DSpace at EWHA: A Study on the Predictive Model of Amount of Calls to take a Call Taxi in Gwangju

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 747 Download: 0

A Study on the Predictive Model of Amount of Calls to take a Call Taxi in Gwangju

Title: A Study on the Predictive Model of Amount of Calls to take a Call Taxi in Gwangju

Authors: 박현지

Issue Date: 2020

Department/Major: 대학원 통계학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 유재근

Abstract: The purpose of this study is to use public data provided by SKT Big Data Hub to predict the amount of calls made to use the call taxis in Gwangju and to identify important variables through them. The methodolgies used are linear regression model, Best Subset Regresison, Ridge, Lasso, principal component regression(PCR), partial least squares(PLS), Random Forest, Support Vector Machine(SVM), Gradient Boosting, Neural Network. When the model was fitted using the entire Gwangju city, the performance of the gradient boosting model was significantly better than that of other models. At the time, the importance of explanatory variables was highest in order of Gu(region), Year, Hourf(time zones), Month, Temp(temperature), day, pm10(fine dust value), Rain(precipitation), and Wind(wind speed). When drawing partial dependence plots for variables of high importance, the first thing to look at is the gu variable, which shows the highest call taxi call volume in Gwnagsan-gu and the lowest in Dong-gu. Next, if you look at the year variable, the call taxi call volume decreases over time, and as the hourf increases, the call taxi call volume increases. If you look at the Month variable, you can see that there are more call taxi calls at the beginning of the year than at the end of the year. In addition, the temp variable shows that call taxi calls increase when the temperature is low. Since the Gu variable was found to be the most important, I tried to fit the model by dividing each region. The Seo-gu and Buk-gu showed that the gradient boosting model was better performance than other models, and the random forest model in Gwnagsan-gu, the lasso model in the Dong-gu, and the ridge model in the Nam-gu were better. As a result, the choice of a different final model for each region confirms that the Gu variables are more important than the other variables.;본 연구의 목적은 SKT Bigdata Hub에서 제공하는 공공데이터를 이용하여 광주광역시의 콜택시를 이용하는데 시도한 통화량을 예측하는 모형을 적합하고, 그를 통해 중요변수를 알아보려는 데 있다. 사용한 방법론은 선형회귀모형, Best Subset Regression, Ridge, Lasso, PLS, PCR, Random Forest, Support Vector Machine(svm), Gradient Boosting, Neural Network이다. 광주광역시 전체 데이터를 사용해 모형을 적합했을 때, gradient boosting 모형의 성능이 다른 모형에 비해 월등히 좋은 것으로 나타났다. 그 때 설명변수의 중요도를 보면 Gu(region), Year, Hourf(time zones), Month, Temp(temperature), Day, Pm10(fine dust value), Rain(precipitation), Wind(Wind speed) 순으로 중요도가 높았다. 중요도가 높은 변수에 대해 partial dependence plot을 그려보았을 때 가장 먼저 gu 변수를 보면 광산구에서 콜택시 통화량이 가장 많고 동구에서 가장 적은 것을 확인할 수 있다. 다음으로 year변수를 보면 시간이 흐를수록 콜택시 통화량이 줄어들고 hourf는 커질수록 콜택시 통화량이 많아진다. Month 변수를 보면 연말에 비해 연초에 콜택시 통화량이 많은 것을 확인할 수 있다. 또한 temp변수를 보면 기온이 낮을 때 콜택시 통화량이 많아지는 것을 알 수 있다. Gu변수가 가장 중요하다고 나와 각 구별로도 나누어 모형을 적합해보았다. 서구, 북구는 gradient boosting 모형의 성능이 다른 모형에 비해 월등히 좋은 것으로 나타났고 광산구는 random forest 모형, 동구는 lasso 모형, 남구는 ridge 모형이 좋은 것으로 나타났다. 구마다 다른 최종모형이 선택된 것은 구 변수가 다른 변수들에 비해 중요하기 때문이라는 것을 확인시켜준다.