DSpace at EWHA: 데이터마이닝 技法을 활용한 스팸메일 分類 및 豫測 모형 構築에 關한 硏究

Browse

My Repository

DSpace at EWHA경영대학원 경영학전공 Theses_Master

View : 1118 Download: 0

데이터마이닝 技法을 활용한 스팸메일 分類 및 豫測 모형 構築에 關한 硏究

Title: 데이터마이닝 技法을 활용한 스팸메일 分類 및 豫測 모형 構築에 關한 硏究

Authors: 안수산

Issue Date: 2000

Department/Major: 경영대학원 경영학전공

Publisher: 이화여자대학교 경영대학원

Degree: Master

Abstract: 기업의 환경에서 이메일(email)은 회사내외의 업무흐름에 중대한 변화를 가져왔다. 업무 공간의 극복, 사내 커뮤니케이션의 극대화 등 이메일이 제공하는 장점은 매우 많다. 그러나 최근 사회적 문제가 되고 있는 스팸 메일(spam mail)의 등장은 이러한 장점에 커다란 반대급부를 제공한다. 스팸메일이란 이메일 이용자들이 원하지도 않는데 무작위로 발송되어 오는 광고성 이메일을 일컫는 말로, 벌크(bulk)메일, 정크(junk)메일, 언솔리시티드(unsolicited)메일이라고도 불린다. 스팸메일은 이용자들에게 상당한 스트레스를 줌은 물론, 이를 발신하고 수신하는 과정에서 서버에 과도한 부하를 줄 뿐만 아니라, 공공의 성격을 지니는 네트웍 자원을 아무런 대가 지불 없이 독점하게 되는 종지 않은 결과를 가져오게 된다. 본 연구에서는 데이터마이닝 기법 중 분류(classification task) 문제에 활발히 적용되고 있는 인공신경망(artificial neural networks)과 의사결정나무(decision tree)기법을 이용하여 스팸메일의 분류와 예측을 가능케 하는 모형을 구축한다. 본 연구 외에도 지금까지 스팸메일을 차단하기 위한 많은 솔루션들이 개발되어 왔지만, 적용에는 많은 한계점을 가지고 있었다. 본 연구는 이메일에 등장하는 단어를 변수로 하여 스팸메일 발견모형을 구축함으로써 기존의 반스팸 메일 솔루션에 추가로 적용 가능한 방법론을 제시하였다는데 의의가 있다.;Concern about the proliferation of unsolicited bulk email, commonly referred, to as spam", has been steadily increasing. When received in small quantities, spam may annoy recipients, bet rarely poses a significant problem. However, some recipients of large quantities of spam find themselves so overwhelmed with unwanted email that it is time-consuming or difficult for them to ferret out their desired correspondence. As spam recipients become increasingly annoyed, ISP's have been deluged with complaints. Furthermore, som ISPs report that spam places a considerable burden on their systems. A variety of technical countermeasures to spam have been proposed: the simplest are already being implemented; some of the more extreme could require dramatic changes to the ways we communicate electronically. In addition, there has been growing support in Korea and the U.S. for laws that would restrict the sending of spam. The major factors that contribute to the problem are these : the low price of bulk email, and cheap pseudonyms. Bulk email is inexpensive to send. Pseudonyms are inexpensive to obtain. Serious bulk mailers invest a few hundred dollars in specialized software capable of sending 250,000 messages with forged headers per hour and harvesting email addresses from Usenet, the Web, and online services. Many kinds of anti-spam solutions have been applied in various ways. There is growing concern that the volume of spam sent each day may increase substantially and that bulk mailers may adopt increasingly sophisticated techniques to thwart automated filtering tools. However we sill get more spam mail everyday because filtering tools have limitation. Data mining is the exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules. The goal of data mining is to allow a corporation to improve its marketing, sales, and customer support operation through better understanding of its customers. Various technologies have been used for data mining are Neural networks, decision tree, genetic algorithm, case-based reasoning, association rule. In this thesis we explore knowledge from large email data using Neural Network and Decision Tree of data mining techniques. Our experimental results show that Decision Tree shows relatively better predictive accuracy and comprehensiveness. We believe that this thesis can make a little contribution and have some meaning for blocking spam mail, if it is additionally applied to existing anti-spam solutions. But in the future we will need more research for minimizing classification errors.