DSpace at EWHA: A Study of Fast String Matching Algorithms

Browse

My Repository

DSpace at EWHA일반대학원 수학과 Theses_Master

View : 793 Download: 0

A Study of Fast String Matching Algorithms

Title: A Study of Fast String Matching Algorithms

Authors: 김경미

Issue Date: 1980

Department/Major: 대학원 수학과

Keywords: string matching; algorithms; 수학

Publisher: 이화여자대학교 대학원

Degree: Master

Abstract: The fast string-matching algorithms are described that search all occurrences of one given string "pattern" within another string "text". Let m be the length of the pattern, and n the length of the text. Also let k denote the location of the first occurrence of the pattern in given text, and let q be the number of distinct alphabet in pattern. And, assume that the text string is read from an external file. Then one searching method, searches from left to right of the pattern, discovered by Knuth, Morris, and Pratt, requires O(m+n) time units, and O(m) locations of internal memory, in the worst case. While, the other searching method, searches from right to left of the pattern, found out by Boyer, and Moore, requires O(k+m) time units, and O(m+q) locations for table in the worst case. On the other hand, the expected number of characters actually inspected for a random English pattern of length 5, of the second searching method is less than (k+5)/3, before finding a match at k. But the first method inspects exactly k characters. Therefore, we might expect the second method to be faster than the first one. But, there are several situations in which it may not be advisable to use the second algorithm. The obvious one is ; if the expected penetration k at which the pattern is found is small, the processing time is significant and one might therefore consider using the first one.;입력되는 "test" string 으로 부터 주어진 string인 "pattern"을 빨리 찾아내는 두 가지 방법에 관해 연구하였다. 첫째 방법은 pattern의 왼쪽에서부터 오른쪽으로 일치시켜 나가는 방법으로, Knuth, Morris, 그리고 Pratt에 의해 고안되었다. 둘째 방법은 pattern의 오른쪽에서 시작하여 왼쪽으로 일치시켜 나가는 방법으로, Boyer와 Moore에 의해 고안되었다. Pattern의 길이를 m, text 길이는 n, text에서 전체 pattern을 처음 찾아낸 위치를 k, 그리고 pattern에 나타나는 서로 다른 문자의 수를 q 라고 하자. 그러면 최악의 경우에, 첫째 방법은 O(m+n) 단위 시간과 O(m)의 기억장소를 필요로 하고, 둘째 방법은 O(k+m)의 단위 시간과 O(m+q)의 기억장소를 필요로 한다. 한편, 평균적으로 pattern을 찾기 위해 text와 비교해야 하는 문자의 수를 확률적으로 계산하여 그 기대치를 구하였다. 둘째 방법의 경우, 특히 pattern의 길이가 5인 임의의 영문자로 이루어져 있을 때 그 기대치는 (k+5)/3이 된다. 이는 첫째 방법이 k인 경우와 비교해 볼 때 둘째 방법이 더 빠르다고 기대할 수 있다. 그러나 k 값이 작을 경우에는 첫 번째 방법을 사용하는 것을 고려해야 한다.