DSpace at EWHA: Image Generation of Hazardous Situations in Construction Sites for DNN Model Training using Text-to-Image Synthesis

Browse

My Repository

DSpace at EWHA일반대학원 건축도시시스템공학과 Theses_Master

View : 210 Download: 0

Image Generation of Hazardous Situations in Construction Sites for DNN Model Training using Text-to-Image Synthesis

Title: Image Generation of Hazardous Situations in Construction Sites for DNN Model Training using Text-to-Image Synthesis

Authors: 김하영

Issue Date: 2023

Department/Major: 대학원 건축도시시스템공학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 이준성

Abstract: Deep learning-based computer vision technologies are actively used to prevent construction accidents by automatically monitoring construction sites. To enable computers to identify potential hazardous situations, they must be able to recognize complex relationships between objects based on their recognition of accident-related objects. However, it is challenging to obtain sufficient training image data due to the characteristics of the construction industry. In particular, since construction accidents often involve a combination of various objects, and the deep neural network (DNN) model uses surrounding background information when recognizing objects, it is necessary to secure training data that reflects relationship information, such as the location and arrangement of accident-related objects. Therefore, this study aims to 1) generate virtual images of situations with a high likelihood of accidents where multiple objects are present using a text-to-image generative model and 2) validate the usability of generated images as training data for DNN models. Text-to-image models offer high flexibility and creativity as they can generate infinite images based on textual information. However, there are limited cases where the potential of such models has been validated. If the synthesized data from these models can effectively be utilized as training data for DNN models, it can serve as a valuable tool to address the scarcity of image data in the construction industry. This study generated images of hazardous situations that simultaneously represented workers and other objects, which are major accident-causing elements. To this end, nine cases of hazardous situations were constructed by analyzing the information to be entered into the text-to-image model. Subsequently, the optimal prompt templates were determined to generate the desired images more effectively by systematically inputting the corresponding information. As a result, a total of 3,585 virtual images related to roofs, ladders, and scaffolds were generated by inputting the analyzed prompts into the text-to-image model. Finally, the workers and objects were annotated with segmentation masks to build a training dataset. When three DNN models trained solely on the generated virtual images were tested with real images, they achieved an average accuracy of approximately 62% based on mAP@0.5 in object detection and instance segmentation. Through this, the author found that the model could recognize workers and objects at a certain level, and it is believed that object recognition accuracy will further improve when the number of training images is increased. In particular, it was observed that the segmentation performance varied according to the relative size and degree of overlap of the objects, confirming the usefulness of the image in which several objects appear simultaneously. Therefore, virtual images generated by a text-to-image model depicting workers and objects can be effectively utilized as training data for DNN models in construction safety. This study holds academic significance as it contributes to addressing the scarcity of training data in the construction safety field by proposing using a text-to-image model to generate virtual images. Future research should extend beyond object recognition to infer interactions among objects and gain a deeper understanding of hazardous situations in construction sites. ;최근 건설현장을 자동으로 모니터링함으로써 건설사고를 선제적으로 예방하기 위해 딥러닝 기반 컴퓨터 비전(Computer Vision) 기술이 활발히 활용되고 있다. 컴퓨터가 건설현장의 잠재적 위험상황을 인식하기 위해서는 사고 관련 객체들에 대한 인식을 바탕으로, 객체 간의 복잡한 관계까지 인식하는 단계로 확장되어야 한다. 우선 정확한 객체 인식을 달성하기 위해서는 대량의 이미지로 딥러닝 모델을 학습시켜야 하는데, 건설현장의 특성상 학습 데이터를 확보하는 것이 매우 어렵고 제한적이다. 특히 건설사고는 다양한 객체가 복합적으로 연관되어 발생한다는 점과 딥러닝 모델은 주변 정보까지 활용해 객체를 인식한다는 점에서, 사고 관련 객체들의 위치나 배치 등의 관계 정보가 반영된 학습 데이터를 확보할 필요가 있다. 이에 본 연구는 1) 텍스트 기반 이미지 생성 모델(text-to-image generative model)을 활용하여 다중객체들이 나타난 사고 발생 가능성이 높은 상황의 가상 이미지를 생성하고, 2) 해당 이미지들을 딥러닝 모델의 학습 데이터로 활용가능한지 검증하는 것을 목표로 한다. 텍스트 기반 이미지 생성 모델은 텍스트 정보만으로 목표하는 이미지를 무한으로 생성할 수 있다는 점에서 높은 수준의 유연성과 창의성을 제공하지만, 그 잠재력이 검증된 사례가 많지 않다. 해당 모델로 합성된 데이터가 딥러닝 모델의 학습 데이터로 효과적으로 활용될 수 있음이 입증되는 경우, 건설산업 내 이미지 데이터 부족 문제를 해결하는데 가치 있는 도구로 활용될 수 있다. 본 연구는 주요 사고 유발 객체인 작업자와 그 외 기인물을 딥러닝 모델의 인식 대상으로 설정하고, 이들을 동시에 나타낸 위험 상황 이미지를 생성하였다. 우선 텍스트 기반 이미지 생성 모델에 입력할 정보, 즉 프롬프트(prompt) 내용을 분석함으로써 총 아홉 가지 위험 상황을 입력 정보로 구성하였다. 다음으로 해당 정보를 체계적으로 입력함으로써 보다 목표하는 이미지를 생성하기 위한 최적의 프롬프트 템플릿을 결정하였다. 결과적으로, 앞서 도출된 정보를 텍스트 기반 이미지 생성 모델에 입력하여 기인물(지붕, 사다리, 비계)별로 약 1,200장의 가상 이미지를 생성하였다. 이후 이미지 내 작업자 및 기인물에 대해 세그맨테이션(segmentation) 마스크로 라벨링하여 학습 데이터셋을 구축하였다. 본 연구에서 생성된 가상 이미지만으로 학습된 기인물별 총 세 개의 딥러닝 모델에 대해 실제 이미지로 테스트한 결과, 객체 탐지(object detection) 및 분할 (instance segmantation) 작업에 대해mAP@0.5 기준 평균62% 수준의 정확도가 도출되었다. 이를 통해 모델이 작업자와 기인물을 일정 수준으로 인식할 수 있음을 확인하였으며, 추후 학습 이미지 수를 증가시킬 경우 객체 인식 정확도는 더욱 향상될 것으로 사료된다. 특히 객체들의 상대적인 크기와 겹침 정도에 따라 객체 분할 성능이 변화함을 관찰하여, 작업자와 기인물이 동시에 등장하는 이미지의 유용성을 확인하였다. 따라서 본 연구는 텍스트 기반 이미지 생성 모델로 생성된 작업자 및 기인물이 함께 나타난 가상 이미지가 건설 안전 분야에서 학습 데이터로 효과적으로 활용될 수 있다는 결론을 도출하였다. 본 연구는 텍스트 기반 이미지 생성 모델을 활용해 가상 이미지를 생성하는 방안을 제시함으로써 건설 안전분야 내 학습 데이터 부족 문제의 해결에 기여했다는 점에서 학술적 의의를 가진다. 향후 객체 인식에서 나아가 객체 간의 상호작용을 추론하는 확장된 연구를 수행함으로써 건설 위험상황을 보다 심층적으로 파악할 필요가 있다.