DSpace at EWHA: In-Depth Study on a Functional Bloom Filter

Browse

My Repository

DSpace at EWHA일반대학원 전자전기공학과 Theses_Ph.D

View : 525 Download: 0

In-Depth Study on a Functional Bloom Filter

Title: In-Depth Study on a Functional Bloom Filter

Authors: 변하영

Issue Date: 2020

Department/Major: 대학원 전자전기공학과

Publisher: 이화여자대학교 대학원

Degree: Doctor

Advisors: 임혜숙

Abstract: Hash-based data structures have been widely used in many applications. By hashing, a variable-length element is converted (hashed) to a fixed-length index, which can be used to locate an entry in a hash table. An intrinsic problem of hashing is collision, in which two or more elements are hashed to the same value. If a hash table is heavily loaded, more collisions would occur. Elements that could not be stored in a hash table because of the collision cause search failures. Many variant structures have been studied to reduce the number of collisions, but none of the structures completely solves the collision problem. This dissertation claims that a functional Bloom filter (FBF) provides a lower search failure rate than hash tables, when the load factor of a hash table is close to 1. While hash tables require to store each key and its return value, an FBF stores return values without keys, because different index combinations according to each key can be used to identify the key. In search failure rates, the FBF is theoretically compared with hash-based data structures, such as a multi-hash table, a cuckoo hash table, and a d-left hash table. Simulation results prove the validity of theoretical results and compare the results between the structures. The simulation results show that the search failure rate of the multi-hash table is 5%, which is ten times larger than that of the FBF, which is 0.5%, for α=1 in storing 2^17 elements. Hence, the FBF achieves a much lower search failure rate than other hash-based data structures. In addition, this dissertation proposes FBF-based network algorithms in various applications: Internet protocol (IP) address lookup and name lookup in named data networking (NDN). Namely, a vectored-Bloom filter structure is proposed for high-speed IP address lookup, and a 2-phase Bloom filter structure is proposed for high-speed name lookup in NDN. This dissertation also proposes two advanced architectures for an FBF, which provide lower search failure rates than a single FBF in the same memory size. One of the advanced architectures, named the 2-stage functional Bloom filter structure, includes a secondary functional Bloom filter storing the elements that could not be programmed into a primary FBF because of index collisions. Adding a secondary FBF can reduce search failures caused by indeterminables. In other words, using two FBFs is more effective than using a single FBF in terms of search failure rates when processing a large amount of data in a limited memory. Theoretical and simulation results show that the 2-stage FBF structure lowers the search failure rate further. The results present the ratio of the memories allocated to each of the two FBFs, which achieves the lowest search failure rate. The theoretical results are validated through experiments, thereby demonstrating that the best search failure rate is 0.03~0.07%, when the secondary FBF uses 3% of the total memory, while the worst-case search failure rate is 0.6~0.8%, when a single FBF is implemented with the entire memory. The other advanced architecture, named the learned functional Bloom filter structure, replaces the core part of an FBF with a deep-learning model. An FBF can be decomposed into a learned model and auxiliary structures to provide the same semantic guarantees as those of an FBF. The learned FBF structure has two advantages. In terms of search performance, for a given memory size, the learned FBF structure has a better search failure rate than a single FBF. In terms of auxiliary structures of a learned model, adding a Bloom filter and a functional Bloom filter to the model can improve the classification accuracy when a learned model is used for multiclass classification, under two assumptions: positive inputs included in a set (i.e., elements in positive classes) are limited and negative inputs not included in the positive classes are not limited. For model training, character-level neural networks (NNs) are used with pretrained embeddings. In the experiment, four types of different character-level NNs are trained: a single gated recurrent unit (GRU), two GRUs, a single long short-term memory (LSTM), and a single one-dimensional convolutional neural network (1D CNN). Each learned FBF structure is more effective as the data size increases because the memory requirement of the model in the structure is fairly small relative to the data size. Simulation results show that the learned FBF structure reduces the search failure rate by 1.653~1.877% in the same amount of memory as a single FBF. ;해시 기반 데이터 구조는 입력 키에 대하여 해당하는 값을 반환하는 키-값 데이터 구조로써 다양한 어플리케이션에서 사용되고 있다. 가변 길이의 원소는 해싱을 통해 고정 길이 인덱스로 변환(해시)되어, 해시 테이블에서 매치되는 엔트리를 찾는 데 사용될 수 있다. 해싱의 본질적인 문제는 둘 이상의 원소가 동일한 해시 값을 가지는 해시 충돌이며, 해시 테이블에 과도하게 많은 원소가 저장될수록 더 많은 충돌이 발생하게 된다. 충돌로 인해 해시 테이블에 저장될 수 없는 원소는 검색 과정에서 검색 실패(search failure)를 야기한다. 따라서 충돌 횟수를 감소시키기 위한 데이터 구조들이 다양하게 연구되어져 왔으나, 해시 충돌을 완전히 해결하는 데이터 구조는 존재하지 않는다. 본 논문에서는 해시 테이블의 로드 펙터가 클수록 함수 블룸 필터(functional Bloom filter; FBF)가 해시 테이블보다 검색 실패율 측면에서 효율적인 구조임을 제안한다. 해시 테이블은 키와 반환 값을 쌍으로 저장하는 반면, 함수 블룸 필터는 오직 반환 값 만을 저장한다. 함수 블룸 필터에서는 키가 저장되지 않더라도, 키에 따른 인덱스들의 서로 다른 조합이 키의 역할을 대신할 수 있다. 즉, 본 논문에서는 제한된 크기의 메모리에 많은 양의 데이터를 저장하는데 있어, 해시 기반 데이터 구조인 멀티 해시 테이블(multi-hash table), 쿠쿠 해시 테이블(cuckoo hash table), 잔여 해시 테이블(d-left table)보다 함수 블룸 필터가 더 효과적임을 제안한다. 검색 실패율은 키-값 데이터 구조의 성능을 평가하는 데 가장 중요한 판단 기준이다. 따라서 본 논문에서는 각 구조에 대한 검색 실패율을 이론적으로 분석하고 비교하며, 실험결과를 통해 이론적 결과의 타당성을 입증하여, 함수 블룸 필터가 제한된 메모리에서 가장 효과적인 구조임을 증명한다. 실험 결과는 2^17 원소를 갖는 집합에서 멀티 해시 테이블의 로드 펙터가 1일 때, 멀티 해시 테이블의 검색 실패율(5%)이 함수 블룸 필터의 검색 실패율(0.5%)보다 10배 크다는 것을 보여준다. 따라서 함수 블룸 필터는 다른 해시 기반 데이터 구조보다 검색 실패율이 훨씬 낮으며, 동일한 크기의 메모리가 사용될 때 가장 정확한 결과를 제공하는 효과적인 구조라고 할 수 있다. 또한 본 논문에서는 IP 주소 검색 및 NDN 환경에서의 이름 검색과 같은 다양한 응용 분야에서, 함수 블룸 필터를 사용한 네트워크 알고리즘을 제안한다. 즉, 고속 IP 주소 검색을 위하여 하나의 함수 블룸 필터를 사용하는 vectored-Bloom filter 구조를 제안하였으며, NDN에서 고속 이름 검색을 하기 위하여 두 개의 함수 블룸 필터를 사용하는 2-phase Bloom filter 구조를 제안하였다. 더 나아가, 본 논문에서는 동일한 메모리 크기에서 단일 함수 블룸 필터보다 낮은 검색 실패율을 제공하는 개선된 함수 블룸 필터 구조를 두 가지 제안한다. 두 구조 중 하나는 2-stage functional Bloom filter 구조로써, 인덱스 충돌로 인해 기본 함수 블룸 필터(primary FBF)로 프로그래밍 될 수 없는 원소를 저장하기 위하여 보조 함수 블룸 필터(secondary FBF)를 구축한다. 보조 함수 블룸 필터를 추가하면 판별불가(indeterminable) 결과로 인한 검색 실패를 줄일 수 있다. 즉, 제한된 메모리에서 대량의 데이터를 처리 할 때 검색 실패율 측면에서 단일 함수 블룸 필터를 사용하는 것보다 2-stage FBF 구조를 사용하는 것이 더 효과적이라고 할 수 있다. 이론 분석 및 실험 결과는 2-stage FBF 구조가 단일 함수 블룸 필터보다 더 낮은 검색 실패율을 가진다는 것을 입증하며, 가장 낮은 검색 실패율을 달성하는 두 함수 블룸 필터에 대한 메모리의 비율을 보여준다. 이를 통해 보조 함수 블룸 필터가 총 메모리의 3%를 사용할 때 최적의 성능(검색 실패율 0.03~0.07%)을 가지고, 총 메모리에 대하여 단일 함수 블룸 필터가 사용될 때 최악의 성능(검색 실패율 0.6~0.8%)을 가짐을 알 수 있다. 개선된 함수 블룸 필터 구조 중 또 다른 하나는 learned functional Bloom filter 구조로써, 함수 블룸 필터의 핵심 부분을 딥 러닝 모델로 대체하는 구조이다. 함수 블룸 필터는 학습된 모델과 보조 구조들로 분해되어 동일한 의미를 보장할 수 있다. Learned FBF 구조는 두 가지 측면에서 이점을 가진다. 검색 성능 측면에서, 주어진 메모리 크기에 대해 learned FBF 구조는 단일 함수 블룸 필터보다 더 낮은 검색 실패율을 갖는다. 또한 학습된 모델의 보조 구조적 측면에서, 모델이 멀티 클래스 분류에 사용될 때 하나의 블룸 필터와 하나의 함수 블룸 필터를 추가함으로써 분류 정확도를 향상시킬 수 있다. 이 때 두 가지 가정 하에서 모델이 사용된다. 즉, 집합에 포함된 양성 입력(양성 클래스들 중 하나에 포함된 입력)은 제한되며, 어떤 양성 클래스에도 포함되지 않은 음성 입력은 제한되지 않는다. 모델 트레이닝을 위해서, 사전 훈련된 임베딩과 문자 단위 뉴럴 네트워크(neural network; NN)가 사용된다. 실험에서는 네 종류의 모델에 대하여 학습하였다 (하나의 GRU(gated recurrent unit) 레이어를 포함한 모델, 두 개의 GRU 레이어를 포함한 모델, 하나의 LSTM(long short-term memory) 레이어를 포함한 모델, 하나의 1차원 CNN(convolutional neural network)을 포함한 모델). Learned FBF 구조는 모델의 메모리 요구량이 데이터 크기에 비해 상당히 작기 때문에 데이터 크기가 증가할수록 더 효과적이라고 할 수 있다. 실험 결과를 통해 learned FBF 구조가 동일한 메모리를 갖는 단일 함수 블룸 필터보다 1.653~1.877% 더 낮은 검색 실패율을 가진다는 것을 알 수 있다.