DSpace at EWHA: A comparison study of random survival forests for competing risks analysis with rare events

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 839 Download: 0

A comparison study of random survival forests for competing risks analysis with rare events

Title: A comparison study of random survival forests for competing risks analysis with rare events

Authors: 이수현

Issue Date: 2022

Department/Major: 대학원 통계학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 이동환

Abstract: 전세계적으로 초고령화 사회로 접어듦과 동시에 코로나 바이러스 (COVID-19)등 신종 질병의 등장으로 생존에 대한 관심이 많아지고 있다. 이에 따라 의학에 대한 통계적 연구가 활발히 진행되고 있는데, 사망하기까지 걸리는 시간을 추정하는 생존 분석 기법이 대표적이다. 가장 널리 알려져 있는 분석 모델로는 Cox PHM 이 있다. 그러나 비례위험 가정을 전제해야하며 사망에 영향을 끼치는 원인이 2 개 이상일 때, 즉 원인들이 서로 경쟁 위험(competing risk) 관계에 있을 때, 예측 확률이 낮다는 단점이 있다. 따라서, 우리는 경쟁 위험이 있을 경우, 예측 정확도가 높은 모델을 찾고자 한다. 빅데이터 시대의 도래로 활발히 적용되고 있는 머신러닝 기법 중 ‘랜덤 포레스트’ 모델을 생존 분석과 결합한 Random Survival Forests 를 중점적으로 바라본다. 해당 모델은 특정한 가정과 분포를 고려하지 않으며, 부트스트랩 형태와 무작위성 부여의 특징과 함께 높은 정확도의 장점을 가지고 있어 데이터 구조가 복잡할 경우 생존 확률 예측에 도움이 될 것이라 기대한다. 경쟁 위험 분석에서 활발히 사용되고 있는 Cause-Specific Hazard과 Fine and Gray 모델과 함께 최신 기법인 Random Survival Forests 를 비교한다. 중도 절단 자료의 비율 조정, 비선형 변수 생성 등 다양한 시뮬레이션 환경을 설정하고 실제 데이터에 모델을 적용시키며 Random Survival Forests의 강점을 알아보고자 한다. ;As the world enters an aging society and new diseases such as coronavirus (COVID-19) emerge, statistical studies on survival data are actively conducted. Studies about estimation on time it takes to death are on spotlight and the best known analysis model is Cox Proportional Hazard model (Cox PHM). However, there are limitations due to restrictive assumptions along with poor performance of datasets with high censoring rate. Also, when it comes to cases with multiple causes that affect death, we say these causes are in competitive risk relationships with each other. We would like to find the model with high prediction performance in competitive risk analysis. In particular, machine learning techniques are actively used recently due to the advent of large size data, known as big data. Thus, we focus on Random Survival Forests, which combines the model 'Random Forest' with survival analysis. This model does not consider specific assumptions and distributions and shows high accuracy in datasets with large sample size. Therefore, we expect to enhance prediction of survival probabilities when data structure is complex. We compare Random Survival Forests model with Cause-Specific Hazard and Fine and Gray models, which are traditionally used in competitive risk analysis. We compare each model’s performance in various simulation environments, such as adjusting the censoring rate or generating nonlinear variables. Thus, by applying Random Survival Forests to real data, we investigate the strengths of the model.