DSpace at EWHA: 트윗에서 추출한 스트레스 감성과 토픽의 공간적 특성 연구

Browse

My Repository

DSpace at EWHA일반대학원 사회과교육학과 Theses_Ph.D

View : 1443 Download: 0

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	강영옥	-
dc.contributor.author	강애띠	-
dc.creator	강애띠	-
dc.date.accessioned	2016-08-26T04:08:16Z	-
dc.date.available	2016-08-26T04:08:16Z	-
dc.date.issued	2016	-
dc.identifier.other	OAK-000000122755	-
dc.identifier.uri	https://dspace.ewha.ac.kr/handle/2015.oak/214292	-
dc.identifier.uri	http://dcollection.ewha.ac.kr/jsp/common/DcLoOrgPer.jsp?sItemId=000000122755	-
dc.description.abstract	This paper contains several theme to find the spatial characteristic on the stress sentiment and topic extracted from tweet data by twitter users. A text mining is first conducted to extract a stress sentiment and stress topics from the collected tweet data. The second explores a inferencing of twitter users’ home location to map the stress sentiment and topics from extracted tweet data more appropriately in the spatial point. The third is the spatial visualization of the stress sentiment and topics and to identify the regional differences. “Sentiment analysis” and “topic modeling” are used to extract the sentiment and the topic from the tweeter dataset. By applying “lasso method”, an overdetermined problem is resolved during the big data analysis. In this research, the stress sentiment is defined by the twitter users to express their stress response into the tweet data. The twitter users’ response is divided into two parts, one is negative expression about their stress situation, such as they are building the stress. The other is positive expressions such as they solve the stress by doing something. As the result of sentiment analysis, the tweet data acquired the attribute of the stress sentiment score and the words extracted by the morphological analysis acquired the sentiment coefficients, so the stress sentiment dictionary was established. By using an LDA algorithm, 15 topics are extracted associated with the stresses of twitter users. Except for “commercials for stress” and “relief of another’s stress”, 3 theme are classified such as a cause, a result, and a resolving method. For the cause of the stress, “personality”, “learning”, “job”, “family, and “SNS use” are included. For the result of the stress, “illness”, “mental status”, and “hair loss” are included. For the resolving method, “drawing”, “gaming”, “exercise”, “nutrition”, and “music” are used. For the second stage to inference of twitter users’ home location, daily movement pattern model and the daily activity field model are used to establish a logistic regression model. By this equation, geolocated tweet datasets are achieved 34 times more than a simple geotagged method. A database framework is achieved to analyze stress tweet dataset associated with the stress sentiment points, topics, and twitters users’ home location and to manage the stress tweet data. It is confirmed that the differences of the stress sentiment and topics in the regional(in Korean Sido) scale. I found out that the stress sentiment score by topic was not proportional, in that case I explained the reason why not coincidence was in the use of word frequency. Thus, only the ratio of the specific topics is less superior than the use of the geographic distributions. Total 6 relationships are extracted by the analyses of topic-based stress sentiment points and time series of the some relationships are explored. There are positive relationships between “family” and “learning” topics, “hair loss” and “illness” topics. And There are negative relation between “learning” and “nutrition” topics, “hair loss” and “illness” topics, “drawing” and “exercise”. This study also shows a limitation of the small number of words to be analyzed, but the stress analyses are extended to geographical trends. Afterwards, other SNS dataset can be used to map the stress sentiment and topic models. To improve the model’s accuracy, the machine learning based method can be included to analyze the topic based stress sentiment points, and furthermore, a time serial prediction model can be more explored.;트위터 사용자들이 트윗데이터에 표현한 스트레스 감성과 토픽이 공간상에서 보이고 있는 특성을 찾아내기 위해 본 논문은 다음과 같은 단계를 거쳤다. 첫 번째 단계는 트윗데이터에서 스트레스 감성과 스트레스 토픽을 추출하기 위한 텍스트 마이닝 단계이고, 두 번째 단계는 첫 번째 단계에서 추출된 스트레스 감성과 토픽을 공간적으로 매핑하기 위한 타당한 공간적 연결고리를 찾아내기 위해 트위터 사용자의 거주지역 정보를 유추하는 단계였다. 세 번째 단계는 스트레스 감성과 토픽의 지역차를 시각화하고 확인하는 것이다. 스트레스 트윗데이터에서 스트레스 감성과 토픽을 추출하기 위해 본 연구에서는 텍스트 마이닝의 하위 방법론인 “감성분석”과 “토픽모델링”을 적용했으며, 라쏘기법을 적용하여 빅데이터의 분석에서 발생하는 모델의 과대적합 문제를 해결하였다. 본 연구에서 스트레스 감성은 스트레스에 대한 부정적인 표현이 주를 이루는 누적반응을 의미하며, 스트레스 해소로 긍정적인 표현이 주를 이루는 해소반응을 스트레스 긍정반응으로 정의하였다. 스트레스 감성분석을 통해 수집된 트윗데이터는 감성점수를 부여받았고, 트윗텍스트를 구성하고 있는 단어들을 이용해 스트레스 감성사전을 작성할 수 있었다. 스트레스 토픽은 트위터 사용자가 스트레스에 대해 표현하고 있는 표현의 화제를 의미하는 것으로서 본 연구에서는 LDA알고리즘을 적용하여 15개의 토픽을 추출하였다. 15개 토픽 중 “스트레스관련 광고 및 뉴스”와 “타인스트레스위로”토픽을 제외하고는 스트레스 원인, 결과, 해소방법이라는 3가지 주제로 분류할 수 있었다. 스트레스 원인 주제에는 “성격”, “학업”, “직무”, “가정”, “SNS사용”토픽이 포함되었으며, 스트레스 결과 주제에는 “질병”, “심리적상태”, “두피및탈모”토픽이 포함되었다. 스트레스 해소방법 주제에는 “그림”, “게임”, “운동및문화생활”, “음식섭취”, “노래등”의 토픽이 포함되어 있었음을 확인하였다. SNS데이터의 위치누락을 해소하기 위한 두 번째 단계에서는 SNS사용자들의 일상 이동 및 지역인식 패턴을 활용한 기계학습 기반의 로지스틱회귀모델이 만들어졌다. 이 모델을 활용하여 트위터 사용자가 가장 많은 트윗데이터를 생성한 지역 및 그들의 트윗데이터에서 가장 많이 언급한 지역이 그들의 거주지역일 확률을 구하는 판별식을 구축하였고, 이 판별식을 이용하여 스트레스를 표현한 스트레스사용자의 거주지역을 유추하여 지오태그된 데이터를 이용할 때보다 34배 더 많은 위치취득 트윗데이터를 확보하였다. 또한 본 연구에서는 트윗데이터에서 추출한 스트레스 감성점수와 토픽, 사용자 거주지역 정보를 토대로 지역적 차이를 지속적으로 분석하기 위해서 스트레스 트윗데이터 데이터베이스를 설계하여 지속적인 스트레스 트윗데이터를 수집·분석할 수 있는 기반을 마련하였다. 트윗데이터에서 추출한 스트레스 감성, 토픽을 시도차원에서 다각적으로 분석하여 시도별로 스트레스에 대한 주제와 감성이 차이가 있음을 확인하였다. 토픽별 스트레스 감성점수와 트윗데이터 수와의 관계가 비례하지 않은 경우 시도에서 사용된 단어빈도를 통해 그 이유를 탐색했으며, 각 시도에서 사용된 단어의 감성점수가 시도의 스트레스 감성점수에 영향을 미침을 알 수 있었다. 따라서 특정 토픽이 차지하는 비율만으로 그 시도의 성격을 판단하는 것보다 내부 포함된 단어들의 스트레스 감성점수의 분포를 통해 시도의 특성을 파악하는 것이 더 타당하다는 것을 확인하였다. 토픽별 스트레스 감성점수를 이용하여 토픽간 상관관계를 도출해본 결과 총 8가지의 유의미한 상관관계가 도출되었으며 이들 토픽간의 월별 스트레스 감성점수를 비교하여 시계열간에도 상관관계가 도출하는지 살펴보았다. 결과 정적상관관계를 보인 5개의 관계 중 “가정”토픽과 “학업”토픽은 시간차원으로 보아도 거의 비슷한 스트레스 감성점수를 보이고 있어 시간적으로도 유사한 패턴을 보인다고 보인다. 그러나 나머지 정적관계를 보이는 토픽간의 관계는 유사하다기 보다는 역의 패턴을 보여 시간적으로는 정적관계가 보이지 않음을 확인하였다. 그러나 부적상관을 보이는 토픽들간의 관계는 시간차원의 그래프에서도 부적관계를 확실하게 보이고 있어 부적상관관계는 시간차원에서도 유사하게 나타남을 확인할 수 있었다. 본 논문은 사용자 거주지역 유추 방법론의 정확도 문제, 좁은 범위에서의 단어사용이라는 한계가 있으나, 스트레스라는 사회적 현안에 대한 사람들의 느낌과 이를 표현하는 방법, 이들의 지역차를 규명했다는데 의미가 있으며 비가시적 감성적 현상을 지도화하여 공간데이터의 장을 확장시켰다는데 의의가 있다고 볼 수 있다. 향후 본 논문이 기초가 되어 다음과 같은 발전방향을 모색할 수 있다. 트윗데이터 뿐만 아니라 다른 SNS데이터로 스트레스 감성과 토픽모델을 확장시킬 수 있으며, SNS사용자 거주지역 유추모델의 정확도를 향상시키고, 기계학습 기반 형태소 분석 방법론을 추가하여 스트레스 감성점수와 토픽추출의 신뢰도를 높일 것이다. 또한 토픽별 스트레스 관련 감성점수를 이용한 토픽간 상관관계를 시계열적으로 확장하여 스트레스의 감성과 토픽이 사회적 신호로서 활용 가능성을 높이는 예측모델로 발전시킬 필요가 있다.	-
dc.description.tableofcontents	I. 서론 1 A. 연구배경 1 B. 연구목적 2 C. 연구설계 3 II. 관련 연구 및 분석기술 5 A. SNS데이터에서 추출한 사회적 현안의 시공간적 탐색 연구 5 1. SNS데이터를 이용한 사회적 현안의 공간분포 탐색 연구 동향 5 2. SNS데이터와 지역적 요인과의 연관성 연구 10 B. SNS사용자 거주지역 유추기법 연구 12 1. 사용자 관계망을 이용한 거주지역 유추 방법론 연구 12 2. SNS텍스트의 내용분석을 통한 사용자 거주지역 유추 연구 13 3. 사용자의 일상생활패턴을 이용한 거주지역 유추 연구 14 C. SNS비정형텍스트 분석 방법론(텍스트마이닝) 16 1. 텍스트 마이닝의 정보추출방법 16 2. 텍스트 마이닝의 정보 분석 방법론 18 III. 트윗데이터에서 추출한 스트레스 감성과 토픽 21 A. 스트레스에 관한 감성과 토픽 분석 개요 21 1. 스트레스 감성 및 토픽을 추출하기 위한 연구과정 및 설계 21 2. 스트레스의 개념과 구조 23 B. 데이터 수집 및 전처리 28 1. 데이터 수집과정 28 2. 전처리 과정 30 C. 텍스트 마이닝을 이용한 스트레스 감성과 토픽 추출 32 1. 트윗텍스트 스트레스 내용특성 32 2. 트윗텍스트의 스트레스 감성구분 37 3. 트윗텍스트 토픽구분 48 D. 소결 60 IV. 스트레스를 표현한 트위터 사용자의 거주지역 유추 62 A. 트위터 사용자 거주지역 유추 방법론의 필요성 62 B. 트위터 사용자 거주지역 유추모델 수립 64 1. 모델수립준비 64 2. 모델수립 70 3. 모델적용 81 4. 결과 84 V. 트윗에서 추출한 스트레스 감성과 토픽의 지역적 차이 88 A. 지역적 차이 탐색을 위한 데이터베이스 구축 및 활용 88 1. 스트레스 트윗데이터 데이터베이스 명세 88 2. 스트레스 트윗데이터 데이터베이스 활용 92 B. 스트레스 토픽별 사용자수와 트윗데이터 개수의 지역차 94 1. 스트레스 토픽별 사용자수와 트윗데이터 개수의 산출과정 94 2. 각 시도에서 선호하는 스트레스 토픽 비교 95 C. 스트레스 토픽별 감성점수의 지역차 103 1. 시도의 스트레스 토픽별 감성점수와 시도의 단어사용빈도 산출과정 103 2. 단어사용빈도를 통해 본 토픽별 스트레스 감성점수의 차이 분석 105 3. 스트레스 토픽별 감성점수를 이용한 토픽간 상관관계 도출 113 VI. 요약 및 결론 117 참고문헌 124 부록 131 ABSTRACT 181	-
dc.format	application/pdf	-
dc.format.extent	7680411 bytes	-
dc.language	kor	-
dc.publisher	이화여자대학교 대학원	-
dc.subject.ddc	300	-
dc.title	트윗에서 추출한 스트레스 감성과 토픽의 공간적 특성 연구	-
dc.type	Doctoral Thesis	-
dc.title.translated	A study on regional characteristics on the stress sentiment and topics extracted from tweet data	-
dc.creator.othername	Kang, Ae Tti	-
dc.format.page	xi, 183 p.	-
dc.description.localremark	박079	-
dc.contributor.examiner	이영민	-
dc.contributor.examiner	강영옥	-
dc.contributor.examiner	이건학	-
dc.contributor.examiner	이종원	-
dc.contributor.examiner	홍일영	-
dc.identifier.thesisdegree	Doctor	-
dc.identifier.major	대학원 사회과교육학과	-
dc.date.awarded	2016. 2	-