DSpace at EWHA: Machine learning and deep learning-based models for predicting overweight and obesity in Korean adolescents

Browse

My Repository

DSpace at EWHA일반대학원 사회복지학과 Theses_Ph.D

View : 270 Download: 0

Machine learning and deep learning-based models for predicting overweight and obesity in Korean adolescents

Title: Machine learning and deep learning-based models for predicting overweight and obesity in Korean adolescents

Authors: 이세림

Issue Date: 2023

Department/Major: 대학원 사회복지학과

Publisher: 이화여자대학교 대학원

Degree: Doctor

Advisors: 전종설

Abstract: 청소년의 과체중 및 비만은 한국을 포함하여 전 세계적으로 가장 심각한 위협 중 하나로 보고된다. 청소년의 과체중 및 비만은 성인기까지 지속될 가능성이 높고 심리적, 신체적 건강과 더불어 사회적 폐해에 부정적 영향을 미치기 때문에 주의가 필요하다. 한편, 과체중과 비만은 다양한 개인적 및 사회적 요인의 조합에 의해 영향을 받는 복잡한 문제로 예측하고 예방하기가 어렵다. 머신 러닝과 딥 러닝 기술은 이러한 과체중과 비만의 다면적인 위험을 더 정확하고 안정적으로 파악할 수 있게 해주어 새로운 통찰력을 제공한다. 따라서 본 연구에서는 11개 머신러닝 및 딥러닝 기법을 사용하여 한국 청소년의 과체중 및 비만 예측 모델을 개발하였다: Logistic regression, Ridge, LASSO, Elasticnet, Decision Tree, Bagging, Random forest, AdaBoost, XGBoost, Support vector machine, Full connected layer 모델. 이 연구는 인구사회학적 특성, 생활 습관, 신체적 건강, 심리적 건강, 행동 문제, 가족 요인, 또래 및 학교 요인을 포함한 다중 영역에 걸쳐 청소년의 과체중과 비만에 영향을 미칠 수 있는 71개의 다양한 독립변수를 포괄적으로 다루었다. 또한 변수들의 변수 중요도를 분석하였다. 이를 통해, 구체적인 사회복지적 함의를 분석하고 논의하고자 하였다. 제 16차 한국 청소년건강행태조사를 기반으로 총 43,268명의 데이터가 활용되었다. 데이터 세트는 80%의 학습 데이터 세트와 20%의 테스트 데이터 세트로 구분되었다. 각 머신러닝 및 딥러닝 모델들은 5-fold validation을 통해 검증하고, 정확도, 재현율, 정밀도, F1 점수, AUC 점수을 통해 모델의 예측력과 성능을 비교 및 평가하였다. 본 연구는 오픈소스 프로그래밍 언어인 파이썬을 사용하였다. 구체적으로, Pandas, NumPy, Statsmodel, Scipy, Matplotlib, Scikit-learn 및 Tensorflow를 포함하는 다양한 파이썬 라이브러리가 활용되었다. 연구 결과, 모든 일반 머신러닝 및 딥러닝 기반 모델은 Logistic regression 기반 모델보다 한국 청소년의 과체중과 비만을 예측하는 데 있어 우수한 성능을 나타냈다. 예를 들어, Logistic regression은 테스트 세트에서 0.7662의 정확도, 0.0251의 재현율, 0.5312의 정밀도, 0.0480의 F1 score, 0.6892의 AUC score를 나타내었다. 반면, XGBoost는 0.8403의 정확도, 0.6351의 재현율, 0.6497의 정밀도, 0.6423의 F1 score, 0.8982의 AUC score를 나타냈다. Elasticnet과 SVM을 제외한 Ridge, LASSO, Bagging, Random forest, AdaBoost, XGBoost 및 Full connected layer를 포함한 머신러닝 및 딥러닝 모델은 한국 청소년의 과체중 및 비만에 대해 큰 차이 없이 비교할 만한 우수한 예측 성능을 보여주었다. 다양한 머신러닝 및 딥러닝 방법을 사용하여 분석한 각 변수의 중요도 분석은 한국 청소년의 과체중 및 비만과 관련된 다양한 영역의 중요성에 대한 통찰력 있는 결과를 산출했다. 이러한 영역은 인구사회학적 특성, 생활 습관, 신체적 건강, 심리적 건강, 행동 문제, 가족 영역, 또래 및 학교 영역을 포함한다. 구체적으로 인구사회학적 특성 영역 내에서 개인(성별, 연령), 거주지(도시 규모)가 청소년 과체중과 비만 예측에 중요한 것으로 나타났다. 생활 습관과 관련하여, 식습관(패스트푸드 섭취와 물 섭취), 신체 활동(평일에 앉아 있는 시간), 체중조절 및 신체이미지(월별 체중조절 및 신체이미지 인식)가 상대적으로 중요한 것으로 나타났다. 신체적 건강과 관련하여, 머신러닝 모델들은 구강 건강(양치 횟수)과 주관적 건강(주관적 건강 인식)을 예측을 위한 중요한 변수로 식별했다. 더불어, 본 분석은 불안 장애와 주관적인 행복 수준을 포함한 심리적 건강(정신 건강)이 한국 청소년 과체중과 비만의 예측에서 중요한 역할을 한다는 것을 밝혔다. 행동문제 측면에서는 흡연(평생 전자담배 사용, 전자담배 최초 사용 연령, 담배 구매 접근 용이성, 학교에서의 간접흡연), 스마트폰 중독(평일 중 스마트폰 사용 시간, 스마트폰 의존 경험), 그리고 성적 행동(피임에 대한 경험 및 방법)이 중요한 변수로 나타났다. 가족 요인 영역에서는, 가족 배경(아버지와 어머니의 국적)이 중요한 변수인 것으로 나타났다. 또한 또래 및 학교 요인 영역에서는 학교폭력 (학교폭력으로 인한 병원 치료 경험)이 중요한 변수로 부각되었다. 본 연구는 여러 종류의 머신러닝 및 딥러닝 기법으로 청소년 비만의 다양한 위험요인에 대한 보다 다면적인 이해를 가능하게 함으로써 위험군을 조기에 파악하여 유해영향을 예방할 수 있는 토대를 마련하고, 사회복지 관점에 따른 전략과 시사점을 제공하였다. 본 연구의 결과는 한국 청소년의 과체중 및 비만 문제를 해결하기 위해 다면적 특성을 고려한 맞춤형 통합 예방 프로그램의 필요성을 강조한다. 특히, 이러한 예방 프로그램은 인구사회학적 특성, 생활 습관, 신체 활동, 신체적 건강, 심리적 건강, 행동 문제, 가족, 또래 및 학교 영역을 포함한 다양한 영역을 대상으로 해야 함을 시사한다. 구강 건강, 흡연, 성적 행동, 학교폭력, 부모의 국적 등 새롭게 등장한 중요한 요소들을 포괄적으로 다룰 필요가 있다. 본 연구에서 개발한 한국 청소년의 과체중 및 비만 예측을 위한 머신러닝 및 딥러닝 모델은 향후 과체중 및 비만 예측을 목적으로 하는 실천 현장에서 광범위하게 활용될 수 있다. 본 연구는 한국 청소년의 과체중과 비만을 예측하는 가장 효과적인 접근법을 선택하는 데 중요한 통찰력을 제공한다. 이러한 연구 결과와 논의는 한국 청소년들의 과체중과 비만의 맥락에서 예측 모델링의 발전에 기여하여 청소년 비만 분야에서 정보에 입각한 의사 결정과 맞춤형 예방의 개발을 촉진할 것으로 기대된다. ;Overweight and obesity in adolescents have been reported as one of the most serious threats worldwide including South Korea. Adolescent overweight and obesity require special attention as it can persist into adulthood and negatively affect psychological and physical health and social outcomes. Meanwhile, overweight and obesity are complex issues influenced by a combination of individual and societal factors, making them difficult to predict and prevent. Machine learning and deep learning techniques have empowered researchers to detect the multifaceted risk of overweight and obesity more accurately and reliably, providing novel insights. Therefore, this study developed predictive models for overweight and obesity in Korean adolescents using 11 machine learning and deep learning techniques: Logistic regression, Ridge, LASSO, Elasticnet, Decision tree, Bagging, Random forest, AdaBoost, XGBoost, Support vector machine, and Fully connected layer models. The study collectively and thoroughly covered a diverse set of 71 factors that could influence overweight and obesity in adolescents across multiple domains, including sociodemographic characteristics, dietary habits, physical health, psychological health, behavioral problems, family factor, and peer and school factors. In addition, the study analyzed feature importance of variables. Moreover, the study tried to analyze and discuss the specific social work implications that arise from these findings. A total of 43,268 records from the 16th Korean Youth Risk Behavior Web-based Survey were included in the study. The dataset was divided into an 80% training set and a 20% test set. To assess the model's performance, the study employed 5-fold cross-validation and evaluated several metrics, including accuracy, recall, precision, F1 score, and AUC score. The present study employed Python, an open-source programming language. Specifically, a range of Python libraries were utilized, encompassing Pandas, NumPy, Statsmodels, Scipy, Matplotlib, Scikit-learn, and Tensorflow. The machine learning and deep learning algorithms displayed significantly superior performance in predicting overweight and obesity in Korean adolescents when compared to Logistic regression. For instance, Logistic regression achieved an accuracy of 0.7662, recall of 0.0251, precision of 0.5312, F1 score of 0.0480, and an AUC score of 0.6892 in the test set. Conversely, XGBoost achieved an accuracy of 0.8403, recall of 0.6351, precision of 0.6497, F1 score of 0.6423, and an AUC score of 0.8982 in the test set. Notably, machine learning and deep learning models, including Ridge, LASSO, Bagging, Random forest, AdaBoost, XGBoost, and Fully connected layer, demonstrated comparable great predictive performance for overweight and obesity among Korean adolescents, with no substantial disparities observed. The feature importance analyses, employing diverse machine learning and deep learning methods, yielded insightful findings regarding the significance of various domains in relation to overweight and obesity among Korean adolescents. These domains encompassed sociodemographic characteristics, dietary habits, physical health, psychological health, behavioral problems, family, and school. In specific, within the sociodemographic characteristics domain, individual (i.e., gender and age), and residence (i.e., city size) were found to be significant in predicting adolescent overweight and obesity. Concerning dietary habits, diet (i.e., consumption of fast food and consumption of water), physical activity (i.e., number of hours spent sitting during weekday), and weight control and body image (i.e., monthly weight control and perceptions of body image) exhibited relative importance. Regarding physical health, the machine learning models identified oral health (i.e., number of times of brushing teeth), and subjective health (i.e., subjective health perception) as crucial features for analysis. Furthermore, the analysis revealed that psychological health (mental health), including anxiety disorders and subjective level of happiness, played fundamental roles. In terms of behavioral problems, the analysis identified several significant variables, including smoking (i.e., lifetime use of e-cigarettes, age at first time of using e-cigarettes, ease of access to the purchase of cigarettes, and second-hand smoking at school), smartphone addiction (number of hours on the smartphone during weekday, and experience of overdependence on smartphone), and sexual behavior (experience of birth control or method). Within the family domain, the study found that family background (i.e., nationality of father and mother) was important features for analysis. In the peer and school domain, the analysis highlighted school violence (i.e., experience of hospital treatment due to school violence) as a crucial variable to be examined. The current study's findings emphasize the critical need for collective and customized prevention programs considering multi-facet features to prevent overweight and obesity among Korean adolescents. Specifically, it is recommended that these prevention programs target various domains, including sociodemographic characteristics, dietary habits, psychological health, behavioral problems, family, peer, and school domains. Notably, it is crucial to collectively consider newly-emerged important features, such as oral health, smoking, smartphone addiction, sexual behavior, family background, and school violence. Further, the machine learning and deep learning models developed in this study to predict overweight and obesity in Korean adolescents hold significant potential for extensive utilization in practical applications in social work settings aimed at predicting these health conditions in the future. Thus, this study presents valuable insights into selecting the most effective approach for predicting overweight and obesity in this specific population. The findings and discussions contribute to the advancement of predictive modeling in the context of overweight and obesity among Korean adolescents, facilitating informed decision-making and the development of targeted preventions in the field of adolescent obesity.