DSpace at EWHA: 2008년 사망원인질병에 대한 로지스틱 회귀분석

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 875 Download: 0

2008년 사망원인질병에 대한 로지스틱 회귀분석

Title: 2008년 사망원인질병에 대한 로지스틱 회귀분석

Authors: 신우영

Issue Date: 2010

Department/Major: 대학원 통계학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 오만숙

Abstract: 최근 건강에 대한 관심이 높아지면서 흔하게 알려져 있는 질병에 대한 정보가 증가하고 있다. 또한 평균수명이 높아지면서 질병에 의한 사망을 대비하기 위한 연구도 활발하게 진행되고 있다. 해당 질병에 걸려 사망하는 경우를 잘 살펴보면, 거꾸로 그 질병에 잘 걸리지 않는 요인을 알 수 있기 때문이다. 질병에 영향을 주는 요인으로 환경적 요인과 유전적 요인을 들 수 있다. 환경적 요인에는 여러 가지가 있지만 그 중에서도 특히 인위적으로 노력하는 식습관이나 생활습관이 아닌, 기본 정보가 사망원인에 어떤 영향을 주는지 알아보고자 한다. 본 연구에서는 로지스틱 회귀 분석을 이용하여 ‘질병에 의한 사망원인’에 대해서 알아볼 것이다. 분석에 이용된 자료는 통계청에서 사망원인을 알아보기 위하여 2008년에 조사된 자료로, 총 10개의 변수를 갖고 있는 자료이다. 실제 분석에는 사망원인, 사망 시 주소, 성별, 사망자 직업, 혼인상태, 교육정도, 사망 시 연령 등 7개의 변수를 사용하였다. 또한 이번 조사에 사용된 자료는 ‘질병분류사인코드’를 기준으로 한 것이기 때문에 주로 신체부위에 따라서 질병을 나눈 것이다. 해당 질병의 종류를 크게 17가지로 묶어보아서 잘 알지 못하는 질병에 대해서도 살펴볼 수 있었다. 실제 분석을 하기에 앞서, 변수들끼리 상관관계가 있는지 독립성 검정을 실시하여 회귀모형 안에 넣는 것이 의미가 있는지 먼저 파악하였다. 또한 변수들의 특성과 빈도수를 파악해보고 모형에 적합한 변수들인지 적합도 검정을 실시하였다. 모형에 적합한 변수들이 파악되면, 그 변수들을 가지고 로지스틱 회귀모형을 세워보았다. 로지스틱 회귀모형은 반응변수가 2개인 모형이므로, 각 사망원인에 대해 지시변수를 주어 모형을 세워보았다. 적합도 검정에서 모형에 적합하다고 판단된 변수들을 넣고 로지스틱 회귀 모형을 세워보아서 각 변수들 중에서 공통점이 있는 항목들을 찾아보고, 어떤 상황이 해당 질병에 더 잘 걸리는지 알아보았다. 주소의 경우는 항목이 많아서 지역별로 공통적인 특징을 파악할 수 있었고, 질병마다 어떤 성별의 경우 더 사망확률이 높은지, 직업과 혼인상태, 교육정도의 경우에도 공통적인 특징을 파악할 수 있었다. 본 자료에서의 유일한 수치형 변수인 사망 시 연령의 경우에는, 어린 나이에 사망할 확률이 높은 질병과 나이가 들수록 사망확률이 높은 질병의 종류를 파악해 볼 수 있어서 나이에 따른 경향을 분석 해 볼 수 있었다. 본 논문의 결과들을 이용하여 질병으로 인해 사망하는 각 경우에 대해 어떠한 특징이 있는지 살펴볼 수 있었다. 또한 질병으로 인한 사망의 원인을 큰 특징별로 묶어서 살펴보았으므로 관심이 있는 질병 위주로 성향을 파악할 수 있을 것이다. 지역별로 공통적인 성향은 없는지, 성별에 따른 차이점이 있는지, 직업과 결혼 상태에서는 어떤 범주가 영향을 크게 미치는지, 혹은 사망 시 연령에 따른 경향을 알아볼 수 있는지 등을 논문을 통해서 예측해 볼 수 있을 것이다. ;Recently, the average life span of human being has increased. Statistically that of the Korean is 79.1 years old in 2008. Also people are keeping paying more attention to a healthy life so people want to know more information of the disease. People want to know about elements causing disease in order to avoid getting fatal disease. Because there are several causes of affecting the disease, this paper searches crucial elements causing the disease and also analyzes how the background people have has an effect on their disease. One of methods of finding out the way of getting disease is to figure out the common factors within a group. The data used for this research is from the national statistical office of Korea and the title of them is ‘Causes of Death in 2008’. Total amount of them is 247757 and they have 10 variables. The cause of death is a dependent variable. On the other hand, the address, sex, occupation, marriage, the degree of education and ages of the dead are independent variables. The causes of death are classified into 17 categories of disease. First of all, goodness-of-fit-test for explanatory variables is conducted and then logistic regression model is conducted with using each of 17 dependent variables as an indicator variable. In the goodness-of-fit-test, correlation shows between all independent variables and the dependent variables. Several of the independent variables, which pass significance test, were collected for regression model. And logistic regression model was set up with the variables collected. The model was built only by the primary main effect and then it was used for goodness-of-fit-test. There was no problem happening in building the model only with the primary main effect so that secondary interaction did not have to be regarded for setting up the model. By analyzing the logistic regression model in this research, the background of the dead in 2008 is figured out. In conclusion, the result of the analysis suggests that each disease has its singularity in terms of the background and indicates particular group has more possibility of being dead by particular disease.