This post provides a high level view of the learned Bloom filter, a learning-augmented algorithm which uses a machine learning-based oracle to improve the performance of a standard Bloom filter. The content of this post is derived from my scribe notes for Lecture 4 of 6.890 Learning-Augmented Algorithms taught by guest lecturer Mike Mitzenmacher.

A Bloom filter is a data structure which approximately solves the set membership problem. The set membership problem asks us to determine whether an element $x$ belongs to a set $S$. Using a Bloom filter, we can determine set membership without having to store the entire set while providing probabilistic performance guarantees.

Standard Bloom Filters

The standard Bloom filter is composed of the following components:

  • An array of $m$ bits which represents a set of elements.
  • A set of $k$ hashing functions, each of which maps an element to a value in {$1, 2, ... m$}.

To initialize the Bloom filter, we hash each element in the set using every hash function and set the corresponding bits in the array to $1$. Then, we can check membership by hashing the query element and determining whether all the corresponding bits are set to $1$. A Bloom filter cannot have false negatives since any element that is in the set will have all corresponding bits set to 1 but it can have false positives since the corresponding bits for an element could coincidentally be set to 1.

Consider the following example where we have three hash functinos ($m=3$) represented by red, green, and blue arrows. We can initialize the Bloom filter from our set {$A, B, C$} by setting the bits indicated by the hash function to $1$.

Then, to check whether element $Z$ is in the set, we can simply take the three hash values and determine whether the corresponding bits are all set to 1.

Analysis

Given a set of $n$ elements, a $m$-bit Bloom filter, and $k$ hashing functions, we know that the probability of a specific bit in the Bloom filter being 0 is given by
\begin{equation}
p = (1 - 1/m)^{kn} \approx e^{-kn/m}
\end{equation}
where the approximation follows from the fact that $(1 - 1/m)^m \rightarrow 1/e$ as $m \rightarrow \infty$. Therefore, it follows that the false positive probability is approximately
\begin{equation}
(1-e^{-kn/m})^k
\end{equation}

Knowing this, we can then solve for the number of hash functions $k$ which minimizes the false positive rate and find that
\begin{equation}
k^* = ln(2) \cdot \frac{m}{n}
\end{equation}
which results in a minmized false positive rate of approximately $0.6185^{m/n}$.

In general, most Bloom filter-like data structures (e.g. Cuckoo filters) produce a false positive rate which can be expressed in the form $a^{m/n}$ where $m/n$ is the number of bits in our array per element in the set. We will focus specifically on the Bloom filter case where $a=0.6185$ in our analysis.

Learned Bloom Filters

In [1], researchers from Google Brain suggest that we can do better than standard Bloom filters by learning the set. The key idea is to train an oracle which provides the probability that an element is in the set as a binary classification problem, using elements in the set as positive examples, and randomly or selectively generating negative examples. We could then use the oracle to filter out elements that are extremely likely to be in the set and fall back on a standard Bloom filter to catch any false negatives that were not identified by the oracle, as shown in following figure.

This allows us to maintain our strong guarantee of no false negatives. Even if our learned oracle has a high (e.g. 0.5) false negative rate, our backup Bloom filter would only have to store at most $FN \cdot n = 0.5n$ elements, potentially yielding a space savings of 50% in our Bloom filter (not including the Oracle), a traditionally high-memory data structure. Intuitively, as long as there is structure in the data, the learned Bloom filter should be able to exploit that structure to grow sub-linearly (rather than linearly, as standard Bloom filters do), thus eventually yielding better space requirements. We compare the performance of Bloom filters and learned Bloom filters more rigorously in the following sections.

To see why machine learning can help with set membership problems, consider an extreme example. Suppose your set is a collection of closed intervals in the universe of 32-bit integers. To handle this type of set, you can create a "Bloom filter" where each interval is represented by the start and end point and you can check membership simply by determining whether the value is in one of these intervals. This data structure is smaller and more accurate than the standard Bloom filter because it takes advantage of the structure of the data. Most data has some sort of structure - this is the property we will aim to exploit using machine learning.

False Positives vs "False Positives"

Before proceeding to analyze the false positive rate of learned Bloom filters, we need to define the false positive rate for the oracle and the backup Bloom filter.

Bloom filters. In standard Bloom filters, the false positive probability is defined as the probability that any non-set item yields a false positive. This corresponds to the false positive rate for any collection of non-set items, regardless of their distribution.

Learned bloom filters. The false positive probability of the learned bloom filter is an empirical estimate assuming that the future inputs are drawn from the same distribution as the data used to train the oracle. The assumption is unrealistic for many types of applications. Applications using bloom filters often do not have negative examples in their ``training sets,'' and even if they do, there is no guarantee future negatives in production will look the same as negative examples from the training set. One proposed solution to this problem is to simply retrain the oracle when the false positive rate is too high, as this indicates the actual distribution differs from the training distribution. However, further work in this area is needed to better understand these trade-offs.

To see why this distinction matters, consider the following example:

  • Suppose our universe is $[1, 1000000]$.
  • The set contains 500 random values from $[1000, 2000]$.
  • A learned oracle could, with high probability, accept all values in $[1000, 2000]$.
  • Then, the false positive probability would depend heavily on the query distribution. If the queries are uniform over the whole universe, then the false positive probability would be extremely low. If the queries are uniform over $[1, 10000]$, then the false positive probability would be significantly higher.
  • In contrast, a standard Bloom filter would have a similar false positive probability regardless of the query distribution.
    \end{enumerate}

This example demonstrates that the false positive probability of the learned bloom filter is data-dependent, unlike the false positive probability of the standard Bloom filter.

Analysis

Given a learned oracle of size $x$ with false positive rate $F_P$ and false negative rate $F_N$ as well as a backup bloom filter of size $c \cdot n$ where $n$ is the number of elements in the set, the false positive rate of the learned Bloom filter is given by
\begin{equation}
F_P + (1 - F_P) a^{c/F_N}
\end{equation}
which corresponds to the probability of the oracle producing a false positive plus the probability of the oracle rejecting an element which triggers a false positive in the backup Bloom filter. For comparison, a standard Bloom filter which uses the same number of bits, $x + c \cdot n$, offers a false positive rate of
\begin{equation}
a^{c+x/n}
\end{equation}
which tell us that our learned Bloom filter offers better performance than the standard Bloom filter when
\begin{equation}
a^{c+x/n} > F_P + (1 - F_P) a^{c/F_N}
\end{equation}
is true. Intuitively, this corresponds to scenarios where we have a large number of elements in the set and our model has a low false negative rate.

Sandwiched Learned Bloom Filters

In order to further improve the performance of our learned Bloom filter, [2] introduces an additional standard Bloom filter before the learned oracle to catch false positives preemptively. This effectively ``sandwiches" the learned oracle between two Bloom filters as shown in the following figure.

Intuitively, adding an additional bloom filter allows more control in balancing the size/false positive trade-off inherent in all bloom filters. The authors report that this setup produces a 2-10x improvement in the false positive rate for several standard configurations.

Consider a sandwiched learned Bloom filter with $c_1 \cdot n$ bits in the initial filter, $c_2 \cdot n$ bits in the final filter, and $x$ bits in the learned oracle. The false positive rate for this setup is given by
\begin{equation}
a^{c_1} (F_P + (1 - F_P) a^{c_2/F_N})
\end{equation}
which corresponds to the probability of the first Bloom filter producing a false positive multiplied by the probability of the learned oracle and the final Bloom filter producing a false positive as well.

If we solve the first order conditions for the value of $c_2$ that minimizes the false positive rate while holding $c_1$, $F_P$, and $F_N$ constant, we find that
\begin{equation}
c_2^* = F_N log_{a} \frac{F_P}{(1-F_P) (\frac{1}{F_N}-1)}
\end{equation}
which suggests that the final Bloom filter should have a fixed size which does not depend on the number of elements in the set or the size of the initial Bloom filter.

Finally, we note that these ideas can also be applied to varations on Bloom filters such as Bloomier filters which not only solve the approximate set membership problem but also associate a static value with each element in the set. More details can be found in [3].

References

[1] Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD), pages 489--504. ACM, 2018.

[2] Michael Mitzenmacher. Optimizing Learned Bloom Filters by Sandwiching. 2018.

[3] Michael Mitzenmacher. A model for learned bloom filters and optimizing by sandwiching. Advances in Neural Information Processing Systems, pages 462--471, 2018.