DSpace at EWHA: 문서 감정용 디지털 포렌식을 위한 CNN 기반의 한글 폰트 인식 및 분류

Browse

My Repository

DSpace at EWHA일반대학원 컴퓨터공학과 Theses_Master

View : 1783 Download: 0

문서 감정용 디지털 포렌식을 위한 CNN 기반의 한글 폰트 인식 및 분류

Title: 문서 감정용 디지털 포렌식을 위한 CNN 기반의 한글 폰트 인식 및 분류

Other Titles: Recognition and Classification of Korean Fonts based on CNN for Digital Forensics

Authors: 고운

Issue Date: 2019

Department/Major: 대학원 컴퓨터공학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 조동섭

Abstract: The purpose of this paper is to improve the efficiency of the evaluation and analysis of electronic documents requiring professional digital forensics technology and the reliability of the analysis results. In this paper, to recognize the type of Korean font used in electronic documents, CNN (Convolutional Neural Network), which is one of the Deep Learning techniques, is used to recognize the type of Korean font used in electronic documents. The CNN model proposed in this paper is based on simplified VGGNet, which is applied with the method of batch normalization. The 230,000-image data set consisting of 12 Korean fonts, which are highly frequently used in official documents and private documents, will be used to learn and verify classification models. The practicality of the learned model is then verified using the same test data set as the actual document investigation evidence form. As a result of testing with the proposed model, the classification accuracy of Korean font is 97.0% and the recognition rate of Korean font used in the printed document which model is not learned is 88.2%. In addition, through the feature map of the convolutional layers and the node activation analysis of the fully connected layers, it is proved that the Korean font classification standard can be the end portion and the breaking portion of the character. Although the proposed model is capable of supporting the actual electronic document evaluation and analysis task, it is necessary to acquire the data set and to improve the additional learning model in order to become an optimal learning model for recognizing the many kinds of Korean fonts currently in used. Based on this research, if a function enhancement algorithm for classifying a large number of fonts is added to the proposed classification model, this model can be used as a software module that can be applied to various electronic document analysis tasks.;본 연구는 전자 문서 감정을 위한 것으로 전문적인 디지털 포렌식 기술이 요구되는 전자 문서 감정 업무의 효율성과 감정 결과의 신뢰성을 높이는 것을 목적으로 한다. 연구 내용은 전자 문서에서 사용된 한글 폰트 종류를 인식하기 위해 딥러닝(Deep Learning) 기법 중 하나인 CNN(Convolutional Neural Network)을 이용하여 한글 폰트의 분류 기준이 되는 특징을 학습을 통해 자동으로 추출하고 한글 폰트를 인식하고 분류하는 것이다. 전자 문서의 폰트 인식에 사용되는 한글 폰트로 감정 업무에서 활용도가 높은 한글 폰트 12개를 선정하고 약 23만개의 한글 폰트 이미지 데이터 셋을 구축하여 딥러닝 학습 모델의 학습과 검증에 사용한다. 이후 학습 모델의 신뢰성을 검증하기 위해 실제 문서 감정 증거물 형태와 동일한 테스트 셋을 이용하여 학습 모델의 성능을 분석하였다. 본 논문에서 제안하는 CNN 모델인 배치 정규화 기법(Batch Normalization method)을 적용한 간소화 된 VGGNet으로 실험한 결과, 한글 폰트 인식 정확도는 97.0% 이며, 학습된 형태가 아닌 인쇄 후 스캔한 문서에서 추출한 한글 폰트 이미지를 적용했을 때 88.2% 의 인식률을 나타냈다. 이 같은 모델 검증 및 테스트 결과를 통해 본 연구에서 제안하는 방법이 한글 폰트 인식에 활용될 수 있음을 확인하였다. 또한, 컨볼루셔널 레이어의 특징맵과 완전 연결 레이어의 노드 활성화 분석을 통해 한글 폰트 분류 기준이 문자의 끝 부분과 꺾기는 부분에 있다는 것을 알 수 있었다. 제안하는 모델은 실제 전자 문서 감정 업무를 지원할 수 있는 수준이지만 현재 국내에 유통되고 있는 수많은 한글 폰트 종류들을 인식 및 분류하기 위한 최적의 학습 모델이 되려면 데이터 셋의 확보 및 추가적인 학습 모델 개선 과정이 필요하다. 본 연구 결과를 바탕으로 대량의 폰트 분류를 위한 기능 개선 알고리즘 추가 연구를 진행함으로써 제안하는 분류 모델이 다양한 전자 문서 감정 업무에 적용될 수 있는 소프트웨어 모듈로 활용되길 기대한다.