# Probabilistic Data Structures to Improve System Performance

## Abstract

The purpose of this paper is to introduce two data structures which are very efficient in both time complexity and storage space requirements: Bloom Filters and HyperLogLog.

The drawback of these structures is that they do not provide an exact answer in all cases, instead, they provide answers with a certain margin of error. That margin can be made as small as desired, but not without a trade-off between the space requirements and the error rate. Therefore, programmers should choose error rates that best suit their needs without using too much space.

## Contents

1. Bloom Filters
1. Introduction
2. The Classical Approach vs. Bloom Filters
3. How do Bloom Filters Work?
4. Practical Considerations
5. Applications
6. Variants of the Structure
7. Conclusions
2. HyperLogLog
1. Introduction
2. How does a HyperLogLog Work?
3. Practical Considerations
4. Applications
5. Conclusions

## Bloom Filters

### Introduction

A Bloom Filter is a data structure that can inform whether an element belongs to a set or not. Due to the probabilistic nature of the structure it is not possible to rely entirely on the answer. However, there are cases when the answer is completely accurate: there are not any false negatives as a result of a membership query. That is to say, if an element is not in the set, the answer will be that the element does not belong to the set. On the other hand, false positives are possible. If the element is in the set, the answer is “possibly in the set”, with a margin of error. This structure supports the following operations: insertion of an element, membership query, union and intersection between sets. It is not possible, at least in a standard way, to remove elements from the set.

### The Classical Approach vs. Bloom Filters

A classical approach for querying whether an element belongs to a set is to have the elements themselves represented in a data structure, like Arrays, Linked Lists, Hash Maps, Binary Trees, etc. These methods have different time complexity: O(1) for Hash Map, O(Ln(n)) for Binary Tree, O(n) for Linked Lists. But all of these methods have a space complexity of O(n). They all need to insert the element itself in the structure in order to determine if the element is present in the set. In contrast, the Bloom Filter does not store each element, it uses an array of M bits, where M is constant in the Bloom Filter’s life. Therefore, it is possible to account for all the elements in the set with only a few bits per element.

### How do Bloom Filters Work?

How do Bloom Filters internally work? There are K different hash functions, each one with a different output for the same element. When an element needs to be inserted in the set, the outputs for the different hashes are computed, and for each result, we set in 1 the corresponding position in a bit array modulus M, where M is the size of the bit array. When we need to query the existence of an element inside the set, we compute the output of the different hashes, and check against the bit array. If we found that a position in the array is 0, then the element would definitely not be in the set. On the contrary, if all positions indicated by the hashes outputs are 1, then we cannot be certain that the element is present in the set because those positions could be marked by other inserted elements. A consequence of this is that if all positions of the bit array are marked with 1, then all queries to the Bloom Filter will return “possibly in the set” as an answer, so there is a need to strike a balance between the array length and the number of different elements that are assumed to be inserted. Figure 1 illustrates an example of the bit array containing 3 elements. Figure 1 – Three elements (x, y and z) are inserted in the Bloom Filter. There are three different hash functions that map each element to a position in the array. The filter is being queried to check if a fourth element (w) is in the set. Because one of its positions is marked with 0, then the element is considered out of the set.

Set unions and intersections can be achieved efficiently if two Bloom Filters share the same hash functions, K and M values. The union operation is done by simply performing an OR operation between the elements in each bit array, whereas the intersection operation is done by performing an AND operation between the elements. A Bloom Filter created by the union of two Bloom Filters is the equivalent of a Bloom Filter created from scratch with the union of the sets. In contrast, the Bloom Filter created by the intersection has an error margin greater than that if the Bloom Filter was created by inserting the elements of the intersected sets. The great advantage is, of course, that the operations are trivial to implement and very efficient.

### Practical Considerations

The probability of a false positive (p) can be approximated by an equation involving the number of hash functions (K), the size of the bit array (M) and the number of elements inserted in the set (N). The approximation is given by: With that in mind, for a certain desired margin of error, we need to have an idea of the approximate number of different elements that are going to be inserted and then create the Bloom Filter with all the parameters according to the error margin equation.

Hash functions may have the property of being uniformly distributed over the entire range, and each hash function must produce different results for the same element. Getting a large amount, or a dynamic amount of hash functions may be difficult. There are techniques to produce K different hash functions from only one base hash function. Care must be taken in order to avoid cycling or repeating the output between hashes using these techniques. One approach can be to iteratively apply the base hash function to obtain the different outputs. That is to say, if you have a Hash function H(x), then: The problem of this approach is that it can generate a cycling pattern. Another method could be to modify the original element in order to generate different outputs with the same hash function. For example, using a concatenation operator or else a summation operator: Now, with this method the problem is that consecutive elements that are going to be inserted will collide in the hash, like H1(2) = H2(1). It is possible to mix these two techniques too, but whatever the method chosen to generate the K different Hash Functions, the previously commented problems should be avoided.

The number of different hash functions (K) must be chosen carefully for minimizing the margin of error. For given N and M values, the following equation gives us the optimal number of hash functions needed: Moreover, the size of the bit array (M) can be computed to ensure that the false positive error will not be bigger than a certain value p given the expected number of different elements to be inserted (N). This number is given by the equation: Once a Bloom Filter with the K, M and p parameters is created, the margin of error will certainly be less than p if and only if the number of distinct elements inserted is less than N. Once the value is reached, further elements can be inserted in the set, the Bloom Filter will not impose any restrictions on that, but the false positive error will rise and, eventually, if all bits are marked as 1, then all queries will return “possibly in the set” as a result, which is not useful.

It can be demonstrated that the space requirement is: where p is the margin of error. Reducing the margin in a factor of 10 only increases the space requirements by a factor of 2. Hence the difference in having a margin of error of 0.01 compared to 0.1 (10x) is translated into having only 2x more bits. Figure 3 shows a chart with an example. Figure 3 – Chart showing different error margins with their respective sizes in bits per key and MB per 1 million elements for an optimal Bloom Filter configuration.

An important disadvantage is that it isn’t possible to know the exact number of elements in a set. It is possible, however, to approximate this number using the following formula where X is the number of bits set to 1: ### Applications

The probabilistic nature of Bloom Filters makes it inappropriate for some specific possible scenarios. When you need the exact answer for the question if an element has not previously been seen, then a Bloom Filter can be a good choice. A possible application is to use it as a previous step before an expensive search for an element that is not known to exist or not. When the answer is “NO” there is no need to waste time in an expensive search. When the answer is “MAYBE” then you can perform the expensive search in the hope that the element exists. A low error margin will ensure minimizing the expensive searches that do not return anything.

A generalization of the previous approach can be to have an array of Bloom Filters, one Bloom Filter for each node that is supposed to contain information. For example, in a distributed database with N nodes, each node having a subset of the total information, when there is a need to look for an element, it is better to check against Bloom Filters in order to skip the nodes that do not have the required information.

Possible uses include dictionary lookup, database access, network distribute caching, resource routing, geographic routing, etc. A non-exhaustive list where Bloom Filters are used is: Google BigTable, Google Chrome, Apache HBase, Apache Cassandra, Bitcoin, Squid Web Proxy.

### Variants of the Structure

Classic Bloom Filters have been improved in different ways to overcome their limitations. One extension is the Counting Filter, which allows elements in the set to be deleted. In a Counting Filter each bucket contains n bits (usually 3 or 4), as opposed to the single bit in the Bloom Filter. The insert operation must be modified to increase the counter value, and the delete operation can be implemented by decreasing the value in each bucket. Care must be taken with the overflow and underflow of the bucket.

Another improvement is the Bloomier Filter, in which an element can be associated with a key, implementing a map. The false positive is translated as returning a wrong element in a map.

Scalable Bloom Filters are Bloom Filters that can grow over time, removing the constraint that is required to know the number of elements that are going to be inserted. They are implemented by using standard Bloom Filters chained one behind the other with increasing capacity.

Compressed Bloom Filters are an improvement over the size of the filter at the expense of using more computational resources. At early stages of the standard Bloom Filter, when there are few elements inserted, the bits set to 1 are sparse through the array. This is inefficient in terms of storage, so if you apply a compression algorithm to the bit array, like Run-Length encoding, then the array stream will be shorter, thus improving the size of the structure. There is a point when the filter gets overloaded and the compression algorithm will not be useful, it will instead be counter-productive as it will increase the size of the compressed stream. Such cases must be avoided in order to be space-efficient.

### Conclusions

To conclude, Bloom Filters are a good choice when there is a need to efficiently query the existence of an element in a set and the storage available is limited, though care must be taken with its drawbacks, specially with the false positive error margin. Their fast access time and little storage requirements make them ideal to use as a cheap form of system improvement.

## HyperLogLog

### Introduction

The HyperLogLog Data Structure is used to maintain the cardinality of a set with a requirement in the order of Logarithm of the Logarithm in terms of space storage.

It is only an approximation of the cardinality and not the cardinality itself, but the error in the answer can be made as little as desired. It is counter-intuitive that the space requirement is in the order of O(Ln(Ln(n))) where n is the count of inserted elements, since traditional approaches that provide an exact answer require O(n) in terms of storage space, which can be prohibitive if the number of elements is huge. But due to the fact that the structure does not store the elements themselves and because the answer is not exact, this structure is very efficient and can handle very big numbers, in the order of 2^64 with little memory requirements.

### How does a HyperLogLog Work?

The way a HyperLogLog works is based on the fact that in a set of N uniformly distributed integers, the cardinality of the set can be roughly estimated by considering n as the number of consecutive zeros counting from the least significant bit to the most significant bit for each number in the set. Then the cardinality estimation is 2^n. As an example, assume that the set consist of only 8-bit integers, where X represents any binary digit (0 or 1), then in binary representation we have: Of course this estimation is too rudimentary and has a great variance in the results, so it is necessary to implement variance reduction techniques in order to improve the accuracy of the answer. In the HyperLogLog this is accomplished by dividing the set into m different disjoint subsets, then estimating the cardinality for each subset and applying a mean on the estimations. The harmonic mean is used because it provides a better approximation than others like the arithmetic or geometric mean, and it is a good method for variance reduction. The harmonic mean of a set of numbers is calculated as follows: Using the harmonic mean, the final cardinality estimation is given by the formula: Where a is a correction factor that depends on the m value and is obtained empirically to minimize the error in the estimation.

A way of dividing the original set into m disjoint subsets is to split the binary representation of the number into two parts. The most significant part can be considered as an index of the subset registry, while the least significant part is the number to be inserted in the subset. This implies that m must be a power of two.
For example, if m = 256, we store 32 bit integers, and we need to insert a number with the format xxxxxxxxyyyyyyyyyyyyyyyyyyyyyyyy, then the number yyyyyyyyyyyyyyyyyyyyyyyy will be processed in the registry for the bucket xxxxxxxx. Figure 4 shows an example of this method. Figure 4 – The binary representation of the number is split. The red color indicates the registry to be updated. The green color indicates the number of rightmost zeros. The value of registry 9 is then updated with the number 6 only if its previous value was smaller.

Since the elements need to be uniformly distributed integers, not all values can be used, but this difficulty can be easily overcome by applying a good hash function with those properties over the elements. The choice of the hash function is crucial for the working of the algorithm. The desired properties of hash functions are:

• output of fixed size (typical 32 bits, but preferable 64 bits)
• collision resistance (hard to find two different inputs with the same output)
• cascade effect (small changes in input produce big changes in output)
• uniform distribution (for a random input, every bit in the output has 0.5 probability to be 1 or 0)

### Practical Considerations

As in the case of Bloom Filters, the HyperLogLog structure is not suitable if we need to know the exact number of different elements. Instead, it can only answer with an approximation. The error can be as minimal as desired but at the expense of using more storage space. It can be shown that the error is approximately given by the following formula where m is the number of subset registry used: To count up to 2^64 elements, only 6 bits per registry are necessary because but, with only one registry of 6 bits the error in the estimation will be huge. It is necessary to add more registers in order to improve the accuracy of the estimation. Figure 5 shows a chart comparing the number of registers used, the approximate error and occupied size. Figure 5- Chart showing the relationship between the number of registers, the error rate and the structure size in KB.

Note the last row, where a little more than a million registers are used, the error in the estimation is only about 0.1% and the size of the structure is only 768 KB. With this structure it is possible to count up to 10^19 distinct elements.

One special consideration is that the cardinality estimation tends to be weak when the number of elements is only a few, in the order of m*2.5. In such cases it is better to apply a correction to the estimation, or directly switch to another more accurate method for estimating the cardinality and then when the number of elements is bigger, switch to the HyperLogLog.

On the other extreme, when the number of elements is big enough, in the order of 10^9, the estimation error will tend to be larger and larger, and a correction factor is the only way to get correct results.

The union of two sets may be easily computed by taking the maximum value of each subset register, and the estimated value created that way is exactly the same as if it were created from scratch with the union of the two sets.

Intersection, on the other hand, is not directly possible, that is to say, it isn’t possible to construct a HyperLogLog that represents the intersection based only on the HyperLogLog of the sets. However, the cardinality of the set can still be calculated using the union of the two sets with the inclusion-exclusion principle: When there are few elements and the data stored in the subset registers are sparse, a compression algorithm can be applied to all these registers, for example the Run-Length encoding or another compression technique, thus improving the storage capacity of the HyperLogLog even more. However, when the entropy of the stream increases, the efficiency of the compression algorithms is completely lost and then becomes useless as it increases the size of the stream.

### Applications

Possible applications include: estimating the selectability for tables when planning a query in a RDBMS, or knowing approximately the count of unique visitors in a website based on the IP address. In machine learning it can be used to train an algorithm like Naive Bayes or another one which is not harmed by errors and noise data in the training set. Big Data applications can use this structure too as it handles a large amount of data with ease. This structure is useful when the number of elements is estimated to be very big, in the order of billions, the space requirements are critical and an approximate answer is good enough.

## Conclusions

Probabilistic Data Structures are very powerful and useful structures for speeding up a system. However, they require fine tuning and expertise to make proper use of them. Bloom Filters can determine if an element belongs to a set, whereas HyperLogLog can be used to approximate the cardinality of the set. Used together they provide an alternative representation of sets, in contrast with conventional representations which can be impractical in some scenarios.

• James Blustein – Amal El-Maazawi. Bloom Filters: A Tutorial, Analysis, and Survey.
(https://www.cs.dal.ca/sites/default/files/technical_reports/CS-2002-10.pdf)
• Andrei Broder – Michael Mitzenmacher. Network Applications of Bloom Filters: A Survey.
(http://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf)
• HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm (Philippe Flajolet – Éric Fusy – Olivier Gandouet – Frédéric Meunier)
http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
• Stefan Heule – Marc Nunkesser – Alexander Hall. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.