TY - THES
AU - 盧鉉京
DA - 2005
UR - http://dspace.ewha.ac.kr/handle/2015.oak/172298
UR - http://dcollection.ewha.ac.kr/jsp/common/DcLoOrgPer.jsp?sItemId=000000010493
AB - Data mining is getting popular these days and is used in many applied fields. Data miner is looking for something that is not intuitive. The further away the information is from being obvious, potentially the more value it has. The new information must be valid. If data miners look hard enough in a large collection of data, they are bound to find something of interest, but it must be legitimate and correct. If the process is over-optimized (meaning the results actually moved beyond desired accuracy) or if the results are coincidental (meaning the results found just occurred by chance), this should be revealed in output analysis after the process has completed. In most actual samples, values are commonly missed. In consequence, it is difficult to select the best model and suitable treatment is needed to avoid such case at the first stage. Main focus on this thesis is how to handle missing values on some variables in order to develop a compact model with good predictability.
First we discuss efficient imputation methods. Then, we discuss how to select a compact logistic regression model based on a data mart with incomplete observations. We construct two data sets, whole data set with missing values being imputed by class averages and a sub data set with non-missing observations, and then develop reasonable logistic regression models based on each data set. Based on fit statistics and lift chart, we recommend best model out of four candidate models. Our strategy is illustrated through a case study.
SP - 849604 bytes
LA - eng
PB - 梨花女子大學校 大學院
KW - model selection
KW - logistic regression
KW - data mining
TI - A case study for improving misclassification rates using compact logistic regression
ER -