DSpace at EWHA: 병렬 정렬 합병 조인 알고리즘과 병렬 해쉬 조인 알고리즘의 Cray T3E에서의 구현과 성능 평가

Browse

My Repository

DSpace at EWHA일반대학원 컴퓨터공학과 Theses_Master

View : 865 Download: 0

병렬 정렬 합병 조인 알고리즘과 병렬 해쉬 조인 알고리즘의 Cray T3E에서의 구현과 성능 평가

Title: 병렬 정렬 합병 조인 알고리즘과 병렬 해쉬 조인 알고리즘의 Cray T3E에서의 구현과 성능 평가

Authors: 최지혜

Issue Date: 1999

Department/Major: 대학원 컴퓨터학과

Publisher: 이화여자대학교 대학원

Degree: Master

Abstract: 관계 데이터베이스 시스템은 사용자 질의에 포함된 연산들을 릴레이션에 적용시켜 그 결과를 질의에 부합하는 형태로 제공한다. 릴레이션에 적용된 연산의 실행시간은 릴레이션의 튜플의 크기 뿐 아니라 튜플의 개수의 증가에 따라 증가하게 된다. 여러 연산들 중에서, 질의에서 빈번히 사용되는 조인 연산은 특히 다른 연산들에 비해서 매우 긴 처리시간을 요구한다. 따라서 대용량 데이터베이스 시스템에서의 조인 연산 처리 시간을 감소시키기 위해서는 이 연산의 병렬화가 필수적이라고 하겠다. 조인 연산의 효율성을 높이기 위해 최근들어 많은 연구가 이루어져 왔다. 특히 멀티 프로세서 시스템에서 조인 알고리즘들의 효율적인 병렬화를 위한 연구가 요즈음 많이 이루어지고 있다. 본 연구에서는 기존의 조인 알고리즘들 중에서도 자주 이용되는 해쉬 기반 조인 알고리즘과 정렬 합병 조인 알고리즘을 선택하여 이들을 CRAY T3E 상에서 각각 병렬적으로 구현하고 그 성능을 비교 분석한다. 본 연구에서 사용한 CRAY T3E는 각 노드들 사이에 공유하는 메모리가 없이 자신의 지역 메모리만을 가지고 있으면서 디스크를 공유하는 구조를 가진 병렬 컴퓨터이다. 병렬 프로그래밍 패러다임으로는 구현이 용이하고 호환성이 풍부한 메시지 패싱 방식을 채택하고 그 중 MPI를 이용하여 조인 알고리즘을 구현하였다. 본 연구에서 구현한 알고리즘들의 성능을 평가하기 위하여, 이 알고리즘들을 프로세서 수와 데이터 크기를 변화시켜 가면서 실행시간을 측정하였고, 각 알고리즘의 성능에 미치는 영향을 분석하였다. 실험 결과, 프로세서의 개수가 2개, 4개, 8개인 경우는 병렬 정렬 합병 조인 알고리즘이 더 우수한 성능을 보이고 프로세서의 개수가 16개, 32개인 경우는 병렬 해쉬 조인 알고리즘이 더 우수한 성능을 보인다. 병렬 정렬 합병 조인 알고리즘은 릴레이션의 크기가 증가할수록 더 효율적인 성능을 보인다. 정렬 합병 조인 알고리즘은 알고리즘의 특성상 하나의 프로세서가 처리할 튜플의 수가 많을수록 더 효율적인 알고리즘이고, 반면 해쉬 조인 알고리즘은 프로세서의 개수가 증가할수록 정렬 합병 조인 알고리즘보다 더 나은 성능을 보일 수 있는 알고리즘이다. 이는 병렬 정렬 합병 조인 알고리즘이 병렬 해쉬 조인 알고리즘보다 프로세서간 통신이 더 빈번하여 프로세서의 수가 증가함에 따라 병렬성의 증가와 더불어 통신비용의 증가가 더 많은 영향을 받기 때문이다. 따라서, 프로세서의 개수와 프로세서가 처리할 튜플의 수, 즉 메모리 크기에 따라 적절한 조인 알고리즘을 선택할 수 있다. ; Query results of a relational database system are obtained by applying a series of operations to relations. The time taken by these operations increases as the number and size of tuples of the relations get bigger. Among these operations, the join operation is an operation which is used very frequently in user queries and requires very long processing time. Thus, in order to decrease the processing time, it is necessary to parallelize the join operation. Recently, a lot of work has been done to increase the efficiency of the join operation. Especially, researches for parallelizing the join operation on multi processor systems have been made. In this thesis, we developed parallel algorithms for the Hash based join and for the SortMerge join operations, and implemented them on CRAY T3E. We also compared and analyzed their performances. CRAY T3E is a parallel computer which consists of many processors (nodes) each of which has its own local memory. All the processors share disks. The parallel programming paradigm used for designing the algorithms is MPI. In order to evaluate the performance of the algorithms, we ran the algorithms by varying the number of processors and by changing the number of tuples in relations. The results show that our parallel SortMerge join algorithm is more efficient than the parallel Hash based join algorithm when the number of processors in 2, 4, and 8. When the number of processors is 16 and 32, the parallel Hash based join algorithm is more efficient. The parallel SortMerge join algorithm shows better performance when the size of relations get bigger. When per processor has more tuples to process, parallel SortMerge join algorithm is more efficient. On the other hand, parallel Hash based join algorithm is more efficient than parallel SortMerge join algorithm when more processors participate in the join processing. This is because the communication between processors occurs more frequently in the SortMerge join algorithm than in the Hash based join algorithm. Thus, as the number of processors increases, the degree of parallelizm grows but the communication cost has more influence on the total cost in the SortMerge join algorithm. As a result, the proper algorithm can be chosen in consideration of the number of processors and the number of tuples that per processor processes.