Ribbon Filter: Practically Smaller Than Bloom and Xor: Peter C. Dillinger Stefan Walzer
Ribbon Filter: Practically Smaller Than Bloom and Xor: Peter C. Dillinger Stefan Walzer
ABSTRACT themselves2 . Thus, the Bloom filter filters out almost all specific
Filter data structures over-approximate a set of hashable keys, i.e. set key queries3 to data files that would find no relevant data. False
membership queries may incorrectly come out positive. A filter with negative (FN) queries would be incorrect for this application and
false positive rate 𝑓 ∈ (0, 1] is known to require ≥ log2 (1/𝑓 ) bits must never occur.
Blocked Bloom filters [44, 54] are a popular Bloom variant be-
arXiv:2103.02515v2 [[Link]] 8 Mar 2021
per key. At least for larger 𝑓 ≥ 2−4 , existing practical filters require
a space overhead of at least 20% with respect to this information- cause they are extremely fast. We do not expect to improve upon
theoretic bound. this solution4 for short-lived applications such as database joins
We introduce the Ribbon filter: a new filter for static sets with or the smallest levels of an LSM-tree. However, Bloom filters use
a broad range of configurable space overheads and false positive at least 44% more space (“space overhead”) than the information-
rates with competitive speed over that range, especially for larger theoretic lower bound of 𝜆 = log2 (1/𝑓 ) bits per key for a hashed
𝑓 ≥ 2−7 . In many cases, Ribbon is faster than existing filters for filter with FP rate 𝑓 [13, Section 2.2]. Blocked Bloom can exceed
the same space overhead, or can achieve space overhead below 10% 50% space overhead for small 𝑓 .
with some additional CPU time. An experimental Ribbon design In this work we focus on saving space in static filters, and opti-
with load balancing can even achieve space overheads below 1%. mizing the CPU time required for saving space. Our presentation
A Ribbon filter resembles an Xor filter modified to maximize and validation are kept general, but an intended application is opti-
locality and is constructed by solving a band-like linear system mizing accesses to the largest levels of an LSM-tree, where it should
over Boolean variables. In previous work, Dietzfelbinger and Walzer be worth CPU time to save space in long-lived memory-resident
describe this linear system and an efficient Gaussian solver. We structures5 . In [18] it is shown that a relatively high FP rate for
present and analyze a faster, more adaptable solving process we call these levels is best for overall efficiency. However, the space sav-
“Rapid Incremental Boolean Banding ON the fly,” which resembles ings offered by existing practical Bloom filter alternatives is limited
hash table construction. We also present and analyze an attractive (dotted line in Figure 1 (a)), especially for higher FP rates, 𝜆 ≤ 5.
Ribbon variant based on making the linear system homogeneous, Bloom filter and alternatives. We can categorize hashed filters
and describe several more practical enhancements. by the logical structure of a query:
OR probing: Cuckoo filters [11, 32, 34, 35], Quotient filters [5,
15, 30, 51], and most others return true for a query if any
one of several probed locations is a hashed data match, much
1 INTRODUCTION like hash table look-ups. This design is great for supporting
dynamic add and delete, but all known instances use (1 +
Background and motivation. The primary motivation for this 𝜀)𝜆 + 𝜇 bits per key where 𝜇 > 1.44, or often 𝜇 = 3 for speed6 .
work is optimizing data retrieval, especially in systems aggregating Even with 𝜀 ≈ 0.05, the space overhead is large for small 𝜆.
immutable data resources. In the example of LSM-tree storage [50], AND probing: A Bloom filter query returns true iff all probed
persisted key-value data is primarily split among immutable data locations (bits) are a match (set to 1).
files. A crucial strategy for reducing I/O in key look-ups is filtering
accesses using an in-memory data structure. In a common configu- 2 Learned filters [55] or tries [59] can take advantage of regularities in the key set. A
ration, each data file has an associated Bloom filter [9] representing space-efficient hash table [15] can take advantage of a densely covered key space. We
focus on the general case.
the set of keys with associated data in that file. The Bloom filter has 3 Static filters can also support range queries, either through prefix Bloom [48] or more
some false positive (FP) rate, which is the probability that querying sophisticated schemes [47].
a key not added returns true (positive). For example, configuring 4 In Section 7 we mention and use another blocked Bloom implementation with new
the Bloom filter to use 10 bits of space per added key yields an FP trade-offs [24].
5 In some large-scale applications using RocksDB [31] for LSM-tree storage, we observe
rate of just under 1%, regardless of the size or structure of the keys roughly 10% of memory and roughly 1% of CPU used in blocked Bloom filters. The
size-weighted average age of a live filter is about three days. An experimental Ribbon
filter option was added to RocksDB in version 6.15.0 (November 2020).
This work is licensed under the Creative Commons BY-NC-ND 4.0 International 6 𝜇 > 1.44 can be explained by these structures approximating a minimal perfect
License. Visit [Link] to view a copy of hash [36], a near-strict ordering of keys based on the location of their matching entry
this license. Copyright is held by the owner/author(s). in the structure, on top of 𝜆 payload bits per key. This is an observation about existing
1 Title inspired by [34, 35, 39]. structures, not necessarily a fundamental limitation.
1
Peter C. Dillinger and Stefan Walzer
XOR probing: Xor filters [10, 14, 20, 37, 39] return true for a 𝑓 fastest filter
query iff the bitwise exclusive-or (XOR) of all probed loca-
2−16
tions is a hashed data match. XOR probing is only known to
Bloom
Xor+
Xor
work with static filters. 2−14
Cuckoo
Structures using XOR probing are the most promising for space
2−12
efficient static filters. They are constructed by solving a linear sys-
tem with one constraint per key ensuring that querying the key 2−10
returns true. Standard Xor filters use a fast solving process called Balanced
Ribbon
Blocked Bloom
peeling that limits their space efficiency to ≥ 1.22𝜆 bits per key7 , 2−8
though a variant with fast compression, the Xor+ filter [39], uses
roughly 1.08𝜆 + 0.5 bits per key, which is an improvement for larger 2−6
𝜆. Using structured Gaussian elimination instead of peeling [21, 37] 2−4 Homogeneous
offers better space efficiency, but construction times are considered
impractical for many applications. 2−2
Ribbon
%space
Core contribution. We introduce a faster, simplified, and more overhead
100 50 30 16 8 4 2 1
adaptable Gaussian elimination algorithm (Ribbon) for the static
𝑓 query times 𝑓 construction times
function data structure from [22]. Based on Ribbon, we develop a 2−16 2−16
100 ns 300 ns
family of practical and highly space-efficient XOR-probed filters. 2−14 70 ns 2−14 150 ns
2−12 50 ns 2−12 80 ns
Results and comparison. Figure 1 (a) summarizes extensive bench- 2−10 40 ns 2−10
30 ns 40 ns
marking data by indicating which structure is fastest for satisfying 2−8
20 ns
2−8 20 ns
various space and FP rate requirements for a static filter. For “fastest” 2−6 15 ns 2−6 10 ns
2−4 10 ns 2−4 5 ns
we consider the sum of the construction time per key and three 2−2 % space 2−2 % space
overhead overhead
query times (measured for 𝑥 ∈ 𝑆, 𝑥 ∉ 𝑆 and a mixed data set).8 100 50 30 16 8 4 2 1 100 50 30 16 8 4 2 1
Although we compare Ribbon with many approaches imple-
mented in the fastfilter benchmark library [38], only variants of Figure 1: (a) Fastest filter with the given combination of
Bloom, Cuckoo, Xor, and Ribbon emerge as winners. Specifically, space overhead and false positive rate, considering a mix of
the color at point (𝑥, 𝑦) indicates the fastest filter with space over- construction and query times. (b) Construction and query
head at most 𝑥 and FP rate 𝑓 ∈ [𝑦/2, 𝑦 ·2] for 𝑛 = 107 keys. Diagonal times for fastest approaches in (a).
shading indicates different winners for 𝑛 = 106 and 𝑛 = 108 . The
timings for the winning approach are also shown in Figure 1 (b).
We observe the following.
small Ribbon structures (“smash”). These features go into the
• Ribbon wins to the right of the dotted line because none of
Standard Ribbon filter, which in practice has increasing space
the competing approaches achieve space overhead this low.
overhead or running time as the number of keys increases.9
• Ribbon wins in some territory previously occupied by Xor
Section 4. We present the Homogeneous Ribbon filter, which
and Xor+ filters, mostly for 𝑓 > 2−8 (𝜆 < 8) from Xor and
shares many desirable properties with blocked Bloom filters:
𝑓 > 2−12 (𝜆 < 12) for Xor+. In these cases, Ribbon-style
construction success is guaranteed, and scaling to any num-
Gaussian elimination is faster than peeling.
ber of keys is efficient. Homogeneous Ribbon does not build
• Blocked Bloom filters are still the fastest whenever appli-
on static functions in the standard way, which simplifies
cable, though Cuckoo and Xor take some of the remaining
implementation but complicates analysis.
>44% territory (and nearby) for small FP rates.
Section 5. We describe some practical enhancements and is-
Outline. The paper is structured as follows. sues for Ribbon filters, including (1) efficiently utilizing any
Section 2. We briefly review data structures for static func- amount of memory for any number of keys, (2) laying out
tions and how they give rise to filters. data for efficient queries, (3) efficiently satisfying hashing
Section 3. We describe and analyze the new Ribbon construc- requirements, and (4) scaling Standard Ribbon with data
tion algorithm, which preserves asymptotic guarantees from sharding.
[22]. We also show how to improve the space efficiency of Section 6. We present Balanced Ribbon, an experimental ex-
tension of Standard Ribbon that uses a greedy load balancing
7A new spatially-coupled construction for Xor filters [57] promises lower space over- scheme within a contiguous ribbon. This scales the extreme
heads and in some cases slightly improved construction time with peeling. Simulations space efficiency of small Standard Ribbon filters to very large
indicate a number of keys on the order of 106 or more is needed for most of the benefit,
limiting the known generality of the approach. Configuration in practice is not well 𝑛, such as only 1.005𝜆 + 0.008 bits per key with practical con-
understood. struction and query times.
8 The “right” weighing of construction and query time clearly depends on the use case.
Section 7. We present more experimental validation.
Because LSM-trees are especially useful for write-heavy workloads requiring good
read latency, we find this a reasonable ratio for at least that use case. If a 4KB filter
memory page has a lifetime as long as two weeks and at least one negative (useful)
query per added key (𝑛 roughly 212 ) is seen, that satisfies the current rule of five 9 If
the processor word size is assumed to be Ω (log 𝑛) , query time and space overhead
minutes for caching SSD storage in RAM [2]. can be kept constant. We make no such assumption here.
2
Ribbon filter: practically smaller than Bloom and Xor
√
no matter in what order the keys of 𝑆 are inserted, we always For a smash value of ℓ = 𝑤/2 and a ribbon width of 𝑤 = 𝜔 ( 𝑛)
observe the same sets 𝑃 ∩ {1, . . . 𝑗 } and hence the same set 𝑃 the matrix diagonal is firmly within the ribbon, see Figure 4 (b). It
and the same value 𝐷. The number of row additions of sgauss is not hard to prove that such a matrix is asymptotically as likely
is bounded by 𝐷 by a similar argument (see [22, Lemma 3]). to be regular as a fully random 𝑛 × 𝑛 matrix. That probability is
The analysis in [22] proceeds by bounding 𝐷 in expectation 𝑐 2 ≈ 0.289; see [16].
and hence carries over to our case. □ In Section 5.4 we present empirical findings showing that non-
zero smash values also benefit success probabilities in practically
Efficiency. While sgauss and Ribbon are tied in O-notation, Ribbon log 𝑛
more relevant cases with 𝜀 > 0 and 𝑤 = O ( 𝜀 ).
improves upon sgauss in constant factors for the following reasons:
• There is no need to pre-sort the keys by 𝑠 (𝑥). 4 HOMOGENEOUS RIBBON FILTERS
• sgauss requires explicitly storing a pivot position for each Recall that the idea underlying Ribbon filters is to pick hash func-
row. This is because sgauss does not compute an echelon tions ℎ® : U → {0, 1}𝑚 , 𝑏 : U → {0, 1}𝑟 and find 𝑍 ∈ {0, 1}𝑚×𝑟
form but only ensures that in each row the left-most 1-entry such that all 𝑥 ∈ 𝑆 satisfy ℎ(𝑥)
® · 𝑍 = 𝑏 (𝑥), while most 𝑥 ∈ U \ 𝑆
— the pivot — is the bottom-most 1-entry of its column. will not.
• sgauss performs roughly 𝐷 elimination steps that, depend- We now examine what happens when we get rid of the fin-
ing on some bit, turn out to be xor-operations or no-ops. gerprint function 𝑏 effectively setting 𝑏 (𝑥) = 0 for all 𝑥 ∈ U. A
Ribbon on the other hand performs roughly 𝐷/2 bit shifts filter is then given by a solution 𝑍 to the homogeneous system
and 𝐷/2 (unconditional) xor operations. Though the details ®
(ℎ(𝑥) · 𝑍 = 0𝑟 )𝑥 ∈𝑆 . The FP rate for 𝑍 is 𝑓𝑍 = Pr𝑎∼𝐻 [𝑎 · 𝑍 = 0𝑟 ]
are complicated, intuition on branching complexity seems where 𝐻 is the distribution of ℎ(𝑥)® for 𝑥 ∈ U. An immediate is-
to favour ribbon. sue with the idea is that 𝑍 = 0𝑚×𝑟 is a solution giving 𝑓𝑍 = 1.
A solution 𝑍 chosen uniformly at random from all solutions fares
3.1 Ribbon with Smash better, however. To obtain one, all free variables, i.e. the variables
When aiming for high space efficiency, there is an issue with early corresponding to empty rows of 𝑀, are initialized randomly during
and late columns of the linear system. We shall describe the problem back substitution.14 The overall FP rate is then 𝑓 = E[𝑓𝑍 ] where 𝑍
and its solution in an extreme but simple case where perfect space depends on the randomness in (ℎ(𝑥)) ® 𝑥 ∈𝑆 and the free variables.
efficiency, i.e. 𝑚 = 𝑛 is desired. We call the resulting construction Homogeneous Ribbon filter. It
In the absence of redundant equations, the construction of the has two obvious advantages over Standard Ribbon filters:
linear system succeeds only if all slots of 𝑀 can be filled. For the
• Constructions can never fail, regardless of 𝑛, 𝜀 and 𝑤. This
first 𝑖 slots to be filled, it is necessary that |{𝑥 ∈ 𝑆 | 𝑠 (𝑥) ≤ 𝑖}| ≥ 𝑖.
is simply because a homogeneous linear system always has
Figure 4 (a) illustrates that random fluctuations make this unlikely.
at least the trivial solution.
There is a similar problem relating to the last 𝑖 columns13 .
• The absence of fingerprints slightly improves time and space
in construction. (In optimized implementations, query times
(a) (b)
are essentially the same.)
A complication is that the FP rate 𝑓 can be higher than 2−𝑟 . Intu-
itively, if too many equations constrain some part of 𝑍 then that
part will be insufficiently random. For 𝜀𝑤 > 𝐶𝑟 (for some constant
𝐶 > 0) and large 𝑛, this effect is negligible as we shall argue in
Theorem 4.1. This still leaves us with two disadvantages, especially
for small 𝑛 and high 𝑟 :
• The product 𝜀𝑤 must be proportional to 𝑟 (for acceptable 𝑓 )
whereas for Standard Ribbon 𝜀𝑤 need only be proportional
to log 𝑛 (for acceptable success probability). We should there-
Figure 4: (a) Consider the diagonal (dashed) in a square Rib- fore not expect an improvement over Standard Ribbon for
bon system. Its beginning and end may lie outside of the of 𝑟 = Ω(log 𝑛).
the (shaded) ribbon area. • For small 𝑛, the variance of 𝑓𝑍 is quite high, meaning small fil-
(b) Our “smash” variant solves this problem. ters will occasionally have significantly more false positives
(e.g. due to random skew in (𝑠 (𝑥))𝑥 ∈𝑆 ) with no obvious way
We can address this issue by artificially inflating the probabilities to detect this during construction. This could be dangerous
Pr[𝑠 (𝑥) = 1] and Pr[𝑠 (𝑥) = 𝑚 − 𝑤 + 1] of the first and last starting for some applications.
position by a factor of ℓ we call the smash value. Such a distribution
for 𝑠 is easy to implement using a uniform distribution on [−ℓ + 4.1 Analysis
2, 𝑚 − 𝑤 + ℓ] and “clamping” the sampled value to [1, 𝑚 − 𝑤 + 1]
We can make a strong case for Homogeneous Ribbon filters by
using min and max functions. Micro-benchmarks show roughly
showing that arbitrarily small space overhead at arbitrarily large
3ns overhead per query for smash, on an Intel Skylake CPU.
13 To
see the symmetry, we could have argued about the rank of the first 𝑖 columns of 14 Our implementation uses trivial pseudo-random assignments: a free variable in row
𝑀 which is at most | {𝑥 ∈ 𝑆 | 𝑠 (𝑥) ≤ 𝑖 } | . 𝑖 is assigned 𝑝𝑖 mod 2𝑟 for some fixed large odd number 𝑝 .
5
Peter C. Dillinger and Stefan Walzer
size 𝑛 is achievable. This neither requires the ribbon width 𝑤 to (b1) 𝑖 is contained in a long overloaded interval.
scale with 𝑛, nor a deviation from the pure construction (e.g. by (b2) 𝑖 ∈ pos(𝑆 ′ ) for a minimal dependent set 𝑆 ′ with long non-
partitioning the key set into small shards). overloaded interval pos(𝑆 ′ ).
By space overhead we mean space
opt − 1 where space is the space (b3) 𝑖 ∈ pos(𝑆 ′ ) for a minimal dependent set 𝑆 ′ with short interval
usage of the filter in bits per key and opt = − log2 (𝑓 ) is the pos(𝑆 ′ ).
information-theoretic lower bound for filters that achieve the same We shall now establish the following
FP rate.
Claim: ∀𝑖 ∈ [𝑚] : Pr[𝑖 is bad] = exp(−Ω(𝜀𝑤)).
Theorem 4.1. There exists 𝐶 ∈ R+ such that for any 𝜀 < 1/2,
any desired FP rate 𝜑 > 0 and 𝑟 the closest integer to − log2 (𝜙) the For each 𝑖 ∈ [𝑚] the contributions from each of the badness con-
following holds. For any 𝑤 ∈ N with 𝜀𝑤 > 𝐶 max(log 𝑤, 𝑟 ) and ditions (b1,b2,b3) can be bounded separately. In all cases we use
log 𝑤
𝑛 ∈ N the Homogeneous Ribbon filter with 𝑛 keys and parameters our assumption 𝜀 ≥ 𝐶 𝑤 . It ensures that exp(−Ω(𝜀𝑤)) is at
𝜀, 𝑤, 𝑟 has 𝑓 ∈ [𝜑/2, 2𝜑] and space overhead at most 2𝜀. most exp(−Ω(log 𝑤)) = 𝑤 −Ω (1) and can “absorb” factors of 𝑤
in the sense that by adapting the constant hidden in Ω we have
Our argument starts with the following simple observation. 𝑤 exp(−Ω(𝜀𝑤)) = exp(−Ω(𝜀𝑤)).
Lemma 4.2. In the context of a Homogeneous Ribbon filters let 𝑝 (b1) A Chernoff bound for sums 𝑋 = 𝑗 𝑋 𝑗 of i.i.d. indicator ran-
Í
®
be the probability that for 𝑦 ∈ U \ 𝑆 the vector ℎ(𝑦) is in the span of dom variables with 𝜇 = E[𝑋 ] is
®
(ℎ(𝑥)) . Then we have
𝑥 ∈𝑆 Pr[𝑋 ≥ (1 + 𝛿)𝜇] ≤ exp(−𝛿 2 𝜇/3). (4)
−𝑟
𝑓 = 𝑝 + (1 − 𝑝)2 .
We use it in a case where 𝐼 is an interval and 𝑋 1, . . . , 𝑋𝑛+1
Proof. First assume there exists 𝑆 ′ ⊆ 𝑆 with ℎ(𝑦)
® ®
= 𝑥 ∈𝑆 ′ ℎ(𝑥) indicate which of the keys in 𝑆 + have a starting position within
Í
which happens with probability 𝑝. In that case 𝐼 . For 𝑛 ≫ 𝑤 we have
® ® ® · 𝑍) = 0 (𝑛 + 1)|𝐼 | 𝑛|𝐼 |
∑︁ ∑︁
ℎ(𝑦) ·𝑍 = ( ℎ(𝑥)) ·𝑍 = (ℎ(𝑥) 𝜇 = E[𝑋 ] ≤ ≈ = |𝐼 |/(1 + 𝜀).
𝑥 ∈𝑆 ′ 𝑥 ∈𝑆 ′ 𝑚 −𝑤 +1 𝑚
and 𝑦 is a false positive. Otherwise, i.e. with probability 1 − 𝑝, an Skipping over some uninteresting details, the probability for 𝐼
attempt to add ℎ(𝑦)
® ·𝑍 = 0 to 𝑀 after all equations for 𝑆 were added to be overloaded is (for 𝑛 ≫ 𝑤)
would have resulted in a (non-redundant) insertion in some row Pr[𝑋 ≥ (1 − 𝜀/2)|𝐼 |] ≤ Pr[𝑋 ≥ (1 + 𝜀/6) |𝐼 |/(1 + 𝜀) ]
𝑖. During back substitution, only one choice for the 𝑖-th row of 𝑍 Eq. 4 −𝜀 2 |𝐼 |
|{z} | {z }
satisfies ℎ(𝑦)
® · 𝑍 = 0. Since the 𝑖-th row was initialized randomly ≤ exp( ). 𝛿 ≥𝜇
(5)
we have Pr[ℎ(𝑦) · 𝑍 = 0 | ℎ(𝑦) ∉ span((ℎ(𝑥)) 108(1 − 𝜀)
𝑥 ∈𝑆) ] = 2 .
® ® ® −𝑟 □
The probability for 𝑖 ∈ [𝑚] to be contained in a long over-
We shall now derive an asymptotic bound on 𝑝 in terms of large loaded interval is bounded by the sum of Equation (5) over all
𝑤 and small 𝜀 (recall 𝜀 = 𝑚−𝑛 𝑛 ). It is too imprecise to estimate 𝑝 lengths |𝐼 | ≥ 𝑤 2 and all |𝐼 | offsets that 𝐼 can have relative to 𝑖.
and 𝑓 in practical settings, which we do empirically in Section 4.2. The result is of order exp(−Ω(𝜀 2𝑤 2 ) and hence small enough.
The main takeaway is rather that 𝑝 does not depend on 𝑛. We may (b2) Consider a long interval 𝐼 that is not overloaded, i.e. |𝐼 | ≥ 𝑤 2
therefore expect Homogeneous Ribbon filters to scale to arbitrary and |𝑆𝐼 | ≤ (1 − 𝜀/2)|𝐼 |. There are at most 2 |𝑆𝐼 | sets 𝑆 ′ of keys
sizes 𝑛 with no increase in 𝑓 even when 𝑤 and 𝜀 are constants. with pos(𝑆 ′ ) = 𝐼 and each is a dependent set with probability
Lemma 4.3. There exists a constant 𝐶 such that for any 𝑤 ∈ N 2−|𝐼 | because each of the |𝐼 | positions of 𝐼 that 𝑆 ′ touches
log 𝑤 imposes one parity condition.
and 𝐶 𝑤 ≤ 𝜀 ≤ 12 we have 𝑝 = exp(−Ω(𝜀𝑤)).
The probability for 𝐼 to support at least one dependent set is
The main ingredient in the proof is that exp(−Ω(𝜀𝑤)) bounds therefore at most 2−|𝐼 | · 2 |𝑆𝐼 | = 2− 2 |𝐼 | = exp(−Ω(𝜀 |𝐼 |)).
𝜀
the number of keys that cannot be (non-redundantly) inserted, Similar as in (b1) for 𝑖 ∈ [𝑚] we can sum this probability over
which follows from [22].15 all admissible lengths |𝐼 | ≥ 𝑤 2 and all offsets that 𝑖 can have
Proof Sketch. We may imagine that 𝑆 ⊆ U and 𝑦 ∈ U \ 𝑆 in 𝐼 to bound the probability that 𝑖 is bad due to (b2).
are obtained from a set 𝑆 + ⊆ U of size 𝑛 + 1 by picking 𝑦 ∈ 𝑆 + at (b3) Let 𝑆 red ⊆ 𝑆 be the set of redundant keys, i.e. keys for which
random and setting 𝑆 = 𝑆 + \ {𝑦}. Then 𝑝 is simply the expected Algorithm 1 returns “success (redundant)”. While 𝑆 red de-
fraction of keys in 𝑆 + that are contained in some dependent set, i.e. pends on the insertion order, the rank defect |𝑆 red | = 𝑛 −
in some 𝑆 ′ ⊆ 𝑆 + with 𝑥 ∈𝑆 ′ ℎ(𝑥)
Í ® = 0𝑚 . Clearly, 𝑥 is contained in a rank((ℎ(𝑥))
® 𝑥 ∈𝑆 ) does not. A central step in [22] implies that
dependent set if and only if it is contained in a minimal dependent E[|𝑆 red |] = 𝑚 · exp(−Ω(𝜀𝑤)).16
set. Such a set 𝑆 ′ always touches a consecutive set of positions, i.e. Now if 𝑖 is bad due to (b3) then 𝑖 ∈ pos(𝑆 ′ ) for some minimal
pos(𝑆 ′ ) := 𝑥 ∈𝑆 ′ [𝑠 (𝑥), 𝑠 (𝑥) + 𝑤 − 1] is an interval.
Ð dependent set 𝑆 ′ with short pos(𝑆 ′ ). At least one key from
We call an interval 𝐼 ⊆ [𝑚] long if |𝐼 | ≥ 𝑤 2 and short otherwise. 𝑆 ′ is redundant (even if all keys from pos(𝑆 ′ ) are inserted
We call it overloaded if 𝑆𝐼 := {𝑥 ∈ 𝑆 + | 𝑠 (𝑥) ∈ 𝐼 } has size |𝑆𝐼 | ≥ first). In particular, 𝑖 is within short distance (< 𝑤 2 ) of the
|𝐼 | · (1 − 𝜀/2). Finally, we call a position 𝑖 ∈ [𝑚] bad if one of the starting position of a redundant key 𝑥. Therefore at most |𝑆 red |·
following is the case: 16 The log 𝑛
bound is only used to show that for 𝑤 = Ω ( 𝜀 ) all insertions succeed with
15 Recall Footnote 11. high probability.
6
Ribbon filter: practically smaller than Bloom and Xor
Balanced 𝑤 = 32
Balanced 𝑤 = 64
too large or too small. (Small 𝜀 causes 𝑓 to explode due to densely-
Blocked
Xor, etc.
2−6 a pretty clear pattern in where the space overhead is minimized for
mo Bloom
og. 𝑤
any large 𝑛:
g. 𝑤 =
g. 𝑤 =
g.
2−4 4 + 𝑟 /4
𝑤=
(7)
= 32
𝜀≈ .
1
16
64
2−2 𝑤
28
% space
overhead Note how the recommendation 𝜀 > 𝐶 max(log 𝑤, 𝑟 )/𝑤 one might
100 50 30 16 8 4 2 1
derive from Theorem 4.1 (vaguely) agrees. We use (7) in all ex-
periments. In Figure 5 we show combinations of space overhead
Figure 5: Combinations of space overhead and false positive and FP rate to expect from Homogeneous Ribbon filters for 𝑤 ∈
rate achievable with various ribbon widths 𝑤 and large 𝑛. {16, 32, 64, 128} and large 𝑛.17 Balanced Ribbon filters, also in Fig-
(Experimental data for Ribbon configurations use 𝑛 = 106 ure 5, are discussed in Section 6.
but generalize.) For example, consider using 𝑟 = 7 for roughly 1% FP rate and
𝑤 = 64 for reasonable space-vs.-time trade-off. Using 𝜀 ≈ 0.09 from
Equation (7) we observe 𝑓 ≈ 0.81% > 0.78% ≈ 2−7 , so actual space
2𝑤 2 positions are bad due to (b3), which is an exp(−Ω(𝜀𝑤))-
overhead is closer to 10% than the 9% allocated with 𝜀.
fraction of all positions as desired.
With a concentration argument the following variant of the claim 5 MAKING RIBBON PRACTICAL
can be proved. We omit the details.
5.1 Configurability and elasticity
Claim’: ∀𝑥 ∈ 𝑆 + : Pr[𝑠 (𝑥) is bad] = exp(−Ω(𝜀𝑤)).
A useful and perhaps previously unreplicated feature of Bloom
Now assume that a key 𝑥 ∈ 𝑆 + is contained in a minimal dependent filters is the ability to efficiently utilize any amount of space for
set 𝑆 ′ . It follows that all of pos(𝑆 ′ ) is bad. Indeed, either pos(𝑆 ′ ) is a minimizing the FP rate in representing any number of keys. We
short interval (→ b3) or it is long. If it is long, then it is overloaded call this configurability and suggest it is practically important for
(→ b1) or not overloaded (→ b2). In any case 𝑠 (𝑥) ∈ pos(𝑆 ′ ) is bad. space efficiency. Consider an application with little control over
Therefore, the probability 𝑝 for 𝑦 ∈ 𝑆 + to be contained in a the number of keys going into a filter. Even if we use a perfectly
dependent set is at most the probability for 𝑠 (𝑦) to be bad. This is space efficient filter for a specific FP rate, we could be wasting
upper-bounded by exp(−Ω(𝜀𝑤)) using Claim’. □ significant space due to internal fragmentation from an allocator.
A memory allocator like jemalloc [33] averages about 10% internal
We are now ready to prove Theorem 4.1.
fragmentation on arbitrarily sized allocations, space that should
Proof of Theorem 4.1. We shall find that our filter has 𝑓 ≤ ideally be used by the filter to reduce its FP rate 18 .
2−𝑟 (1 + 𝜀 2 ) and hence 𝑓 ∈ [𝜑/2, 2𝜑] with high probability. More specifically, Bloom alternatives such as Cuckoo, Quotient,
The space consumption of 𝑍 ∈ {0, 1}𝑚×𝑟 is space = 𝑚𝑟 and Xor conventionally use cells of some whole number of bits, as
𝑛 = 𝑟 (1+𝜀)
bits per key. To relate this to opt, we need bounds on 𝑝. Ribbon does with 𝑟 . Whole number 𝑟 limits space-efficient choices
By assumption 𝜀𝑤 > 𝐶 log 𝑤, so a large enough choice for 𝐶 of FP rates and bits per key. For example, using only 5-bit cells
permits the use of Lemma 4.3, which guarantees 𝑝 = exp(−Ω(𝜀𝑤)). when 5.5 bits per cell is available adds roughly 10% space overhead
Again using 𝜀𝑤 > 𝐶 max(log 𝑤, 𝑟 ) for large enough 𝐶 gives to our filter. Other than Ribbon, which is tied to the two-element
field, these same Bloom alternatives can use fractional-bit cell sizes.
1
𝑝 ≤ exp(−2 log(𝑤) − 𝑟 ) ≤ 2 𝑒 −𝑟 ≤ 𝜀 2 2−𝑟 . (6) Some configurations can even be made efficient, such as 64/𝑖 bits per
𝑤
cell for whole 𝑖 (an existing Xor10.666 implementation is tested in
Together with Lemma 4.2 we get Section 7), but fine granularity would surely be more CPU intensive.
opt = − log2 (𝑓 )
Lem 4.2
= − log2 (𝑝 + (1 − 𝑝)2−𝑟 ) An alternative way of generalizing to effectively fractional 𝑟 is to
split available space between two structures: one using ⌈𝑟 ⌉ solution
Eq. 6
≥ − log2 (𝑝 + 2−𝑟 ) ≥ − log2 (2−𝑟 (1 + 𝜀 2 )) columns (or bits per cell) and the other using ⌊𝑟 ⌋ for a weighted
average of 𝑟 , chosen to fit available space. This only slightly in-
= 𝑟 − log2 (1 + 𝜀 2 ) ≥ 𝑟 − 𝜀 2 . creases the overall space overhead. For example, using 𝑟 = 5.5
Putting everything together yields yields (non-homogeneous) FP rate of 3/128. The lower bound for
this rate is 𝜆 = 5.415 bits per key, so the approach adds 1.57% to
space 𝑟 (1 + 𝜀) (1 + 𝜀)
= ≤ ≤ 1 + 2𝜀.
opt 𝑟 − 𝜀2 (1 − 𝜀 2 ) 17 Standard Ribbon filters are not shown because the achievable overhead depends
1 on 𝑛 . We do recommend Standard Ribbon (with smash) for small 𝑛 < 104 where
The last step makes use of 𝜀 < 2. □ Homogeneous Ribbon filters have high variance in FP rate.
18 See RocksDB’s optimize_filters_for_memory option [27].
7
Peter C. Dillinger and Stefan Walzer
space usage, and that choice is much more free when it does not af- success; almost sure construction success with Standard Ribbon
fect instruction-level data alignment, only alignment in CPU caches requires significantly more space overhead than 5% failure chance,
and pages, which we consider a relatively minor concern. Ribbon- a good space-time trade-off in our judgment.
width ICML is good for using ⌊𝑟 ⌋ columns before ⌈𝑟 ⌉ columns Although Standard Ribbon does not scale infinitely for fixed
(Section 5.1), because upon determining the starting memory lo- ribbon width and space overhead, it is more space-efficient than
cation and (smallest) applicable number of columns, which can be standard Xor (23% overhead) for most practical 𝑛, which can be seen
done without conditional branches, the remaining query code does in Table 1 and Section 7. Unlike many other Gaussian structures,
not have to be aware of mixed numbers of columns; see Figure 6. construction speed is not a significant concern for scaling Ribbon
(⌈𝑟 ⌉ before ⌊𝑟 ⌋ is better for pure column major.) to large 𝑛.
A minor disadvantage of ICML is that the number 𝑚 of solution
rows must be a multiple of the number of bits in an ICML word, Scaling with sharding. There are many standard or obvious ways
which can present a conflict between configurability (accommodat- to construct a large, space-efficient filter from many smaller space-
ing any number of keys) and space efficiency for small 𝑛. efficient filters [37, 54]. Two ways of leveraging Ribbon features
are notable, but not evaluated in detail:
5.3 Practical hashing for Ribbon
• If uniformly hash-partitioning keys into fixed-size data struc-
Hash expansion. A filter structure “consuming” some quantity ture shards, the fractional 𝑟 feature of Ribbon (Section 5.1)
of hash information can operate from a smaller hash H (𝑥) of the can be used to accommodate variance in the number of keys
original key 𝑥 [28, 29]. The practical requirements for Ribbon filters mapped to each shard.
are these: • If determining hash ranges to assign to each shard, on-the-fly
• H (𝑥) values must be large enough to have an insignificant banding allows adding entries or buckets (in sharding hash
baseline FP rate due to full hash collisions, i.e. H (𝑥 1 ) = order, independent of start location order) until one fails and
H (𝑥 2 ) for 𝑥 1 ≠ 𝑥 2 . A 64-bit hash for H should suffice for starts the next shard. The “⟨add till failure⟩” rows in Table 1
almost all non-cryptographic applications, as having 232 correspond to this strategy of adding entries until one fails,
keys in a single filter incurs a baseline FP rate of just 2−32 . so yields good average space efficiency without construction
• H (𝑥) is effectively extended / expanded / remixed to what retries, such as < 1% overhead with 𝑚 = 210 per shard, not
is consumed20 . For Ribbon, it is most important to minimize including sharding metadata.
correlations between the starting location and other hash
consumers. A starting location computed with fastrange [45]
Shard sizes. Section 3.1 describes an inherent unlikelihood of fill-
on H (𝑥) relies primarily on upper bits, so multiplying H (𝑥)
ing all slots in a Ribbon system, even if 𝑚 = 𝑤, and how the likeli-
by a large odd constant (as with Knuth multiplicative hash
hood is similar with 𝑚 = Θ(𝑤 2 ). Because the expected number of
[42]) seems to suffice for removing correlation. Details are
empty slots at first failure to add remains constant even for small
in the reference implementation of Ribbon [25].
𝑚, the median proportion of unoccupied slots at failure decreases
Re-seeding. Some Ribbon designs need the ability to retry con- with 𝑚 before increasing with 𝑚, for a fixed ribbon width 𝑤. For
struction with sufficiently independent hashing to have an indepen- common ribbon widths, the minimum appears to be around 𝑤 2 /4,
dent probability of construction success. Observe that for Ribbon which we suggest is the natural shard size, subject to practical
filters a full hash collision does not interfere with construction suc- adjustment for the application.
cess (it only produces a redundant equation). We find in significant
testing that modifying an unseeded stock hash value with simple Soft sharding. With Ribbon we can apply sharding at a higher
XOR with a pseudorandom seed then multiplication by a large odd abstraction layer than memory space, for potentially better space
constant suffices for independent probability of construction suc- efficiency. In a typical hard sharding, construction optimizes for
cess. See [25] for details. Assuming uniform hashes, an effective each shard either (a) a set of keys, (b) a memory size, (c) a hash seed,
alternative to re-seeding on failed construction is simply to increase or (d) some other configuration parameters, based on the others,
𝑚 by a factor of 𝑤+1 and records the optimized configuration in some metadata. The
𝑤 .
change with soft sharding is that each shard is assigned a contigu-
5.4 Standard Ribbon scalability ous range of Ribbon start locations (from a single Ribbon system)
rather than a contiguous memory space (containing an indepen-
We refer to the non-homogeneous Ribbon construction, including
dent Ribbon system). This should be a pure win for space efficiency,
smash when appropriate, as Standard Ribbon. Construction fails
because the overlap of 𝑤 − 1 probing rows between adjacent soft
with some probability depending on 𝑚, 𝑛 and 𝑤, though we have no
shards allows them to, in effect, borrow some space from each other
formula. Table 1 provides some empirical data points for how much
without expending metadata. (A Standard or Homogeneous Rib-
configured space overhead, 𝜀 = (𝑚 − 𝑛)/𝑛, is required for several
bon filter is a naive soft sharding with no metadata guiding the
failure probabilities that represent different construction time vs.
shard assignments.) With some ordering constraints and temporary
solution space trade-offs. Observe that unlike standard Xor filters,
tracking data, the Ribbon algorithm allows us to backtrack, such
Ribbon does not exhibit sharp threshold behavior in construction
as for changing the hash seed within a shard or key assignments
20 In at
least two cases [1, 23], implementations citing an asymptotic result for efficient to shards. We do not analyze the “soft sharding” design space in
hashing in Bloom filters [40] had practical flaws that previous work [28] warned about. detail, but use the idea for Balanced Ribbon.
9
Peter C. Dillinger and Stefan Walzer
6 BALANCED RIBBON
𝛼 𝛼
Balanced Ribbon is an experimental design for scaling and space-
100%
optimizing Standard Ribbon; for implementation details see [26].
100%
We intend Balanced Ribbon as an example in the design space
opened up by the new on-the-fly and incremental Gaussian elimina-
𝛼
tion algorithm, and encourage follow-up work to explore, optimize, Level
1 2 3 ℓ=4
and analyze this design space.
Shard
Balanced Ribbon extends Standard Ribbon with soft sharding 1 2 3 4 5 6 7 𝑠=8
and a new balanced allocation scheme tailored to this domain (re-
lated: [4, 6, 7, 17, 49, 58]). Like many other hashing schemes, we
start with the idea that each entry has two possible locations in Figure 7: Bumping behavior between levels of Balanced
the Ribbon, given by two hashes: an earlier primary location and Ribbon. The expected relative quantity of entries with pri-
a later secondary location in a distinct shard. The shard with the mary locations in each level is indicated by where
primary location is constructed before the shard with the secondary are bumped and added in their secondary location . Not
location and accommodates the key if possible. If not, we say the shown: allocation overheads (𝜀), Poisson variances, bucket
key is “bumped” and must be accommodated in its secondary shard, boundaries, and dispersion within each shard.
so shards add “bumped” entries first for best chance of success.
Metadata is constructed to indicate which keys are bumped21 . The
construction is greedy in that a shard tries to accommodate as many
keys as possible, without considering where keys will be bumped to, perfectly, the average number of entries bumped from each level 1
and constructed shards are never revisited. Difficulty arises if using shard will be 𝛼𝑛/𝑠. Because level 2 shards are half as many, they
two uniform hashes, the smaller primary and the larger secondary: receive 2𝛼𝑛/𝑠 entries on average for adding in secondary location.
later shards are dominated by entries in their secondary location. With those bumped entries, level 2 shards are now overloaded to
Organizing shards for bumping. To make this work we orga- relative weight 1 + 𝛼 compared to 1 − 𝛼 for later shards. With this,
nize the soft shards into levels 1..ℓ, with level 𝑖 containing exactly we have a recursive structure to ensure a continuous supply of
⌈2ℓ−𝑖−1 ⌉ shards, so we assume a power of two number of shards entries eligible for bumping down to the last shard. (Like spatial
overall, 𝑠 = 2ℓ−1 . Unlike some multi-level hashing schemes [12, coupling [57] and “always go left” [56], we are making productive
41, 43], an entry’s primary location can be on any level 𝑖, with its use of less randomness.)
secondary location uniformly on level min(𝑖 + 1, ℓ). Because no The last shard (level ℓ) is different but does not need to be com-
secondary locations are in level 1, we overload it with primary plicated. If we configure our allocated space overhead assuming
locations; level 1 shards have relative weight 1 + 𝛼 for primary loca- Standard Ribbon overheads for the last shard, along with tighter
tions and other level shards have weight 1 − 𝛼. See√︁Figure 7. With overheads for the other shards, it seems to work (single shard Bal-
average 𝑛/𝑠 keys per shard, we choose 𝛼 ≈ 3.5/ 𝑛/𝑠 to ensure anced Ribbon ≡ Standard Ribbon). For large number of shards, we
a sufficient supply of entries even for shards with three Poisson observe the Balanced Ribbon final shards either completely over-
standard deviations below the mean number of entries. (We use whelmed with bumped entries (construction failure) or receiving
𝛼 = 1/8 for 𝑛/𝑠 ≈ 1000.) Assuming we allocate our space overhead almost no bumped entries. We believe this is because the overall
variance in utilization of slots in all prior shards is concentrated
21 Thereis no such disambiguation in [21] and a query would combine information into the last shards, and that variance is large relative to a single
from both locations. This significantly complicates the linear system, however. shard. Because the variance is small overall with a large number of
10
Ribbon filter: practically smaller than Bloom and Xor
shards, construction success is more predictable at scale (apparent Table 2: Experimental performance comparisons
threshold behavior).
Space ns/key, 𝑛 = 106 ns/key, 𝑛 = 108
Buckets for bumping. We use another (semi-)independent hash
Configuration ovr % con query con query
to order or group keys within a shard strictly for bumping. Selecting
a threshold on that hash works well for both Ribbon and metadata ↓ FP rate around 1%, Ribbons using 𝑟 = 7 ↓
space efficiency. First, we have chosen our shard size such that
BlockedBloom[44] 52.0 11 14 ± 0 32 37 ± 0
controlling the number of entries going into a shard is much more
BlockedBloom[24] 49.8 21 10 ± 0 72 36 ± 0
important than the particular set of entries. Second, a Θ(log(𝑛/𝑠))
Cuckoo12† 46.2 69 21 ± 0 147 58 ± 0
bit threshold value is small metadata per shard. However, we do
Cuckoo12 40.3 91 20 ± 0 205 58 ± 0
not want to incur the CPU time for sorting entries into so many
Morton12 40.5 87 55 ± 3 106 125 ± 13
buckets per shard.
Xor, 𝑟 = 7‡ 23.0 195 21 ± 0 264 66 ± 1
Instead of a threshold we use Θ(log(𝑛/𝑠)) buckets of geomet-
Xor, 𝑟 = 8 23.0 148 15 ± 0 211 50 ± 0
rically distributed sizes that can be selected independently, using
Xor+, 𝑟 = 8 14.5 171 35 ± 1 299 104 ± 10
the incremental feature of Ribbon to backtrack on failed buckets.
Homog., 𝑤 = 16 52.0 56 40 ± 6 101 88 ± 6
Using some bit tricks to approximate a geometric distribution with
Homog., 𝑤 = 32 20.6 58 38 ± 6 116 85 ± 5
𝑝 = 2−0.5 seems better than 𝑝 = 2−1 , perhaps due to variance in
Standard, 𝑤 = 64 14;20 71 42 ± 5 130 94 ± 7
actual bucket sizes. We prefer 8 buckets per shard for 8 bits of meta-
Homog., 𝑤 = 64 10.1 83 39 ± 7 160 90 ± 6
data per shard, 0.008 bits per key for common shard size. A subtle
Standard, 𝑤 = 128 6;8 166 58 ± 2 235 140 ± 26
part of maximizing space efficiency with independent buckets and
Homog., 𝑤 = 128 5.1 164 53 ± 3 270 145 ± 25
soft sharding is to attempt adding a larger (in expectation) bucket
Balanced, 𝑤 = 32† 15.3 84 47 ± 6 278 104 ± 5
in shard 𝑖 + 1 before attempting to add a smaller (in expectation)
Balanced, 𝑤 = 32 2.5 162 48 ± 5 372 107 ± 5
bucket in shard 𝑖, if the two shards are in the same level. Successfully
Balanced, 𝑤 = 64 0.7 292 49 ± 5 516 111 ± 5
adding the smaller could overflow enough to make it impossible to
Balanced, 𝑤 = 128 0.3 985 66 ± 1 1335 162 ± 29
add the larger; on average, the better greedy choice is trying to add
the larger bucket first. For CPU efficiency, we skip attempts to add Homog.32, 𝑟 = 7.0 20.6 58 38 ± 6 116 85 ± 5
a bucket that is very likely to fail given the number of successful Homog.32, 𝑟 = 7.7 22.7 61 43 ± 4 120 96 ± 7
additions to the shard. Homog.32, 𝑟 = 8.0 22.1 61 39 ± 5 116 86 ± 9
Overall. Balanced Ribbon construction resembles an external sort ↓ FP rate around 10%, Ribbons using 𝑟 = 3 ↓
between CPU cache and main memory. Queries depend on just Xor, 𝑟 = 3‡ 23.1 194 16 ± 0 264 55 ± 0
one bit of sharding metadata, out of a typical 8 bits per shard. Homog., 𝑤 = 16 34.6 46 23 ± 0 91 66 ± 1
Space usage for 𝑤 = 64 Balanced Ribbon is 1.005𝜆 + 0.008 bits per Homog., 𝑤 = 32 16.1 52 21 ± 0 108 55 ± 1
key, relative to the information-theoretic lower bound 𝜆, as shown Standard, 𝑤 = 64 14;20 66 22 ± 0 124 62 ± 0
in Figure 5. We have tested Balanced Ribbon with these space Homog., 𝑤 = 64 8.0 71 21 ± 0 152 72 ± 4
efficiencies up to 4 billion keys; arbitrary scaling might require Standard, 𝑤 = 128 6;8 145 35 ± 0 213 79 ± 1
natural increases in 𝑤 (𝑤 ∼ log 𝑛 and thus 𝑛/𝑠 ∼ log2 𝑛) or might Homog., 𝑤 = 128 4.2 156 33 ± 0 266 79 ± 1
benefit from design changes. Balanced, 𝑤 = 32 2.8 155 29 ± 0 377 68 ± 1
Balanced, 𝑤 = 64 0.8 288 29 ± 0 517 84 ± 11
7 EXPERIMENTS ↓ FP rate around 2−11 , Ribbons using 𝑟 = 11 ↓
Setup. For validation we extend the experimental setup used for Cuckoo16† 36.0 68 20 ± 0 147 59 ± 0
Cuckoo and Xor filters [35, 38, 39], with our code available in our Cuckoo16 30.2 84 20 ± 0 201 59 ± 1
fork on GitHub [26]. Timings are performed on a single-socket CuckooSemiSort 26.6 146 50 ± 0 323 143 ± 4
Intel® Xeon® D-2191 (Skylake DE) with 64GB RAM. Tests are com- Xor, 𝑟 = 12 23.1 175 23 ± 0 251 74 ± 4
piled with GCC 8.4.1, using g++ -O3 -DNDEBUG -march=skylake- Xor, 𝑟 = 10.666‡ 23.0 191 20 ± 0 284 61 ± 0
avx512. The Ribbon code is portable C++ using no processor intrin- Xor+, 𝑟 = 11‡ 12.8 185 38 ± 0 317 107 ± 2
sics but using compiler built-ins for prefetch, count leading/trailing Homog., 𝑤 = 32 28.0 74 51 ± 2 129 118 ± 32
zero bits, and bit parity. Standard, 𝑤 = 64 14;20 78 54 ± 3 139 138 ± 41
Homog., 𝑤 = 64 12.7 89 51 ± 3 163 118 ± 26
Results. First, Figure 1 shows which approach is fastest for various Standard, 𝑤 = 128 6;9 173 71 ± 9 247 164 ± 39
space overheads and various FP rates, and query and construction Homog., 𝑤 = 128 7.3 189 72 ± 9 292 166 ± 43
times for the corresponding overall fastest approach. It uses a mix Balanced, 𝑤 = 32 2.3 169 60 ± 2 394 142 ± 32
of 𝑛 ∈ {106, 107, 108 }; more details in Section 1. To maximize the Balanced, 𝑤 = 64 0.5 301 61 ± 3 536 147 ± 28
coverage of “BlockedBloom” in the figure, we include an AVX2
† Larger space allocated to improve construction time.
SIMD-optimized implementation from RocksDB [24] that sacrifices
‡ Potentially unfavorable bit alignment.
SIMD-optimized construction for enhanced configurability (any
; Standard Ribbon space overhead depends on 𝑛.
number of probes) and minimized FP rate compared to [44]; all
11
Peter C. Dillinger and Stefan Walzer
“BlockedBloom” in this paper use aligned 512-bit blocks. To maxi- as small), only constant size metadata is read to determine which
mize coverage of “Bloom” we include an obvious variant splitting memory to fetch to complete a query. Although the test relies on
probes between two independent blocks. out-of-order execution between queries for memory prefetching,
Second, Table 2 shows detailed timings for some specific config- applications can use explicit prefetching.22
urations with approximate 𝑓 ∈ {2−3, 2−7, 2−11 }. Construction and To minimize sampling noise, the test machine was otherwise
query times are given in nanoseconds per key over at least 5 · 107 idle. This does not match a production environment, but we have
keys, with a range of query times for different ratios of positive not seen a significant difference in relative results when running
vs. negative queries. The timing data for 𝑛 = 106 represents the the tests under load.
core CPU time of each approach with negligible memory access To match existing test code, hash seeds are not configurable.
overheads, while 𝑛 = 108 includes memory access overheads. The All structures are configured for high chance of construction suc-
𝑟 (1+𝜀)
reported space overhead is log (𝑓 −1 ) − 1 where the FP rate 𝑓 is cess (roughly 99% or more). In practice, approaches other than
2
measured by sampling. Standard Ribbon has distinct overheads for BlockedBloom and Homogeneous would incur some small addi-
the different 𝑛. Cuckoo12 and Morton12 use 12 bits per cell [11, 35], tional overheads associated with seeding and retries.
for similar FP rate as 𝑟 = 9 Xor or Ribbon.
Third, Table 1 shows many more space overheads for Standard 8 CONCLUSION
Ribbon, which mostly improve for smaller 𝑛. While we do not Our result changes the narrative around data structures constructed
include timings for these smaller 𝑛, we can infer that Standard with Gaussian elimination vs. with peeling. With our new algorithm
Ribbon, with or without smash, takes territory from other Ribbon and the query structure from [22], Gaussian elimination can be
approaches in Figure 1 (a) for smaller 𝑛. faster than peeling while also opening up better space efficiency.
On this foundation we have built Ribbon filters with the follow-
Observations. When saving space compared to Bloom and Cuckoo
ing efficiently scaling variants.
(incl. Morton), we see in Table 2,
• The Homogeneous Ribbon filter is simple and has a construc-
• Ribbon can achieve much lower space overheads than the tion algorithm that never fails. We provide a full analysis.
practical alternatives. 𝑤 = 64 Balanced Ribbon has less than • The Balanced Ribbon filter leverages the on-the-fly and in-
1% overhead except when 𝑟 < 3 (bottom right of Figure 1). cremental construction algorithm in an experimental load-
• Ribbon mostly wins in construction time, sometimes by a balancing scheme that further reduces space overhead.
large factor. Ribbon construction times are generally only
higher than Xor when saving space compared to Xor. As with Bloom filters, the true practicality of Ribbon also comes
• Xor mostly wins in query time. For smaller 𝑟𝑤, which is from being able to configure it for or adapt it to the application,
proportional to the contiguous memory loaded per query, including dynamic conditions. This includes using any amount of
Ribbon query times are similar to Xor. Ribbon query times space to efficiently represent any number of keys, even by shrinking
increase with 𝑟 and/or 𝑤. the structure after construction. Ribbon filters replicate the smooth
• Ribbon has a clear advantage for configurability (aside from FP-rate-for-space configurability of Bloom filters and extend that
Bloom). Xor and Cuckoo incur a measurable penalty when with configurability between space efficiency and time efficiency.
cells are not aligned to a friendly size, such as 4, 8, 12, or
16 bits, and have limited options for partial-bit granularity. Future work could pursue the following goals.
Ribbon performance is continuous for any integer 𝑟 , and the • Deepen the theoretical understanding of Ribbon filters to
penalty for arbitrary fractional average 𝑟 is small . better guide parameter choices in practice.
• Ribbon can easily trade space efficiency for construction • Manage the variance of FP rate of Homogeneous Ribbon
time efficiency (like Cuckoo), by increasing allocated space filters at small scales.
overhead; see † vs. corresponding non-† configurations. Xor • Further explore the design space surrounding Balanced Rib-
construction time is more fixed. bon for approaches that are better and/or easier to analyze.
• Create a SIMD-optimized implementation of Ribbon queries,
A somewhat surprising result is that 𝑤 = 32 Balanced Ribbon
perhaps using AVX512 POPCNT and/or 8-bit ICML rather
is sometimes faster and more space efficient than other Ribbon
than 𝑤-bit ICML.
variants with 𝑤 = 128. We can reason that the benefits of smaller
• Explore how competitive Ribbon can be as a data structure
ribbon width sometimes exceed the costs associated with balancing.
for static functions, including cases where the number 𝑟 of
We consider 𝑤 = 128 Balanced Ribbon “impractical” because the
solution columns is quite high, e.g. 𝑟 ≥ 32.
execution time is much higher for a tiny benefit in space usage.
Limitations. A notable limitation of this test is that it tests query ACKNOWLEDGMENTS
throughput more than query latency. Like executing a batch of filter We thank Martin Dietzfelbinger for early contributions to this line
queries, the test does not depend on the result of each query for of research, and Peter Sanders for later discussions. We thank others
what to do next. For applications executing a single filter query with at Facebook for their supporting roles, including Jay Zhuang, Siying
unpredictable outcome, main memory latency can be 200-300ns. Dong, Shrikanth Shankar, Affan Dar, and Ramkumar Vadivelu.
However, we expect latency to be similar between the competing
approaches, because aside from Balanced Ribbon sharding meta- 22 RocksDB MultiGet uses batched filter queries with explicit prefetching; RocksDB
data (very small, cachable) and Xor+ compression metadata (not Get uses single filter queries.
12
Ribbon filter: practically smaller than Bloom and Xor
[48] Yoshinori Matsunobu, Siying Dong, and Herman Lee. 2020. MyRocks: LSM-Tree [54] Felix Putze, Peter Sanders, and Johannes Singler. 2009. Cache-, hash-, and space-
Database Storage Engine Serving Facebook’s Social Graph. Proc. VLDB Endow. efficient Bloom filters. ACM Journal of Experimental Algorithmics 14 (2009), 18.
13, 12 (Aug. 2020), 3217–3230. [Link] [Link]
[49] Michael Mitzenmacher. 2001. The Power of Two Choices in Randomized Load [55] Kapil Vaidya, Eric Knorr, Tim Kraska, and Michael Mitzenmacher. 2020. Parti-
Balancing. IEEE Trans. Parallel Distributed Syst. 12, 10 (2001), 1094–1104. https: tioned Learned Bloom Filter. CoRR abs/2006.03176 (2020), 13. arXiv:2006.03176
//[Link]/10.1109/71.963420 [Link]
[50] Patrick E. O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O’Neil. 1996. [56] Berthold Vöcking. 2003. How Asymmetry Helps Load Balancing. J. ACM 50, 4
The Log-Structured Merge-Tree (LSM-Tree). Acta Inf. 33, 4 (1996), 351–385. (2003), 568–589. [Link]
[Link] [57] Stefan Walzer. 2021. Peeling Close to the Orientability Threshold: Spatial Cou-
[51] Anna Pagh, Rasmus Pagh, and S. Srinivasa Rao. 2005. An Optimal Bloom Filter pling in Hashing-Based Data Structures. In Proc. 32nd SODA.
Replacement. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium [58] Udi Wieder. 2017. Hashing, Load Balancing and Multiple Choice. Foundations
on Discrete Algorithms (Vancouver, British Columbia) (SODA ’05). Society for and Trends® in Theoretical Computer Science 12, 3–4 (2017), 275–379. https:
Industrial and Applied Mathematics, USA, 823–829. //[Link]/10.1561/0400000070
[52] Boris Pittel and Gregory B. Sorkin. 2016. The Satisfiability Threshold for 𝑘 - [59] Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G. Andersen, Michael
XORSAT. Combinatorics, Probability & Computing 25, 2 (2016), 236–268. https: Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2018. SuRF: Practical Range
//[Link]/10.1017/S0963548315000097 Query Filtering with Fast Succinct Tries. In Proceedings of the 2018 International
[53] Ely Porat. 2009. An Optimal Bloom Filter Replacement Based on Matrix Solving. Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA,
In Proc. 4th CSR. 263–273. [Link] June 10-15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein
(Eds.). ACM, 323–336. [Link]
14
Configurability in Ribbon filters refers to their ability to efficiently utilize any amount of space to minimize false positive rates (FP rates) irrespective of the size of the key set. This trait makes Ribbon filters exceptionally adaptable in dynamic environments where the number of keys is unpredictable . Unlike conventional Bloom alternatives that rely on fixed number of whole bits per cell, Ribbon filters can adjust space allocations continuously, thus offering fine-tuned configurability. This includes the flexibility of dropping portions of a filter for a higher FP rate or splitting from a completed filter into smaller structures to accommodate changing requirements . Such elastic properties make Ribbon filters highly applicable in environments with fluctuating demands .
Ribbon filters offer enhanced space efficiency compared to traditional Bloom alternatives due to their smooth FP-rate-for-space configurability and elasticity in space usage. Unlike Bloom filters that use fixed bit sizes per cell, Ribbon filters can handle fractional-bit cell sizes by splitting available space between two structures or within a single structure . This flexibility allows better space efficiency by utilizing any amount of space effectively. Moreover, the configurability extends to trade-offs between space efficiency and time efficiency, making Ribbon filters adaptable to various applications . However, smaller Ribbon widths like w=32 occasionally outperform larger ones like w=128, as seen in specific scenarios with Balanced Ribbon filters, indicating space efficiency can sometimes be better with narrower configurations .
Using fractional-bit cell sizes in filters like Ribbon and Xor allows for a finer allocation of space that closely matches the requirements for minimizing FP rates, thus reducing overall space overhead. This is typically achieved by creating configurations that employ a weighted average of bit allocations per cell or by splitting available space into dual structures, as seen with 64/i bits per cell configurations in Xor filters . For example, adopting a fractional configuration increases space overhead only slightly, around 1% for common setups, while maintaining a low FP rate . The ability to adjust cell sizes fractionally enables a continuum of FP-rate vs. space trade-offs, making these filters highly adaptable to specific spatial constraints .
Ribbon filters can be partitioned to enhance configurability by splitting the finished filter into several smaller structures, each with independent FP rates, while maintaining the product close to the initial FP rate . This partitioning can be done by duplicating minimal necessary bits at each boundary, allowing each partition to function independently or adapt to different requirements . This approach offers the advantage of reducing the filter's footprint to fit exact space constraints while maintaining performance. Partitioning also allows for reallocating space based on real-time demand, optimizing the use of available resources . Such flexibility is particularly beneficial in systems that need to dynamically adjust to varying loads or storage requirements, providing elasticity in application operations .
The trade-off between space efficiency and construction time efficiency in Ribbon filters is crucial for their practical application. Ribbon filters are designed to allow a trade-off similar to what Cuckoo filters offer; they can trade more allocated space for faster construction times . For instance, while a w=32 configuration might offer a balance between speed and space usage, a wider ribbon like w=128 can become impractical due to significantly higher execution time for only marginal space benefits . This trade-off allows Ribbon filters to be configured dynamically based on the specific constraints of space availability and required speed, providing flexibility for applications with varying performance requirements .
Balanced Ribbon filters achieve space overhead reduction through an on-the-fly incremental construction algorithm that balances load across the filter, thereby minimizing unused space during the allocation process . This technique helps distribute the construction effort evenly, much like a balancing scheme, to prevent concentrated areas of high load, which can otherwise increase space overhead. One limitation of using Balanced Ribbon filters is that their construction time can be significantly higher compared to standard Ribbon filters, especially when larger widths are used, such as w=128, which become impractical due to lengthy execution times for minimal space benefits . Furthermore, these filters may also introduce complexities in maintaining configurability under varying conditions, limiting their broader application .
Selecting the optimal epsilon (𝜀) for minimizing space overhead in Homogeneous Ribbon filters involves balancing between too large or too small values, as each extreme can lead to inefficiencies . A small 𝜀 can cause the false positive (FP) rate to escalate due to densely packed constraints, while a too large 𝜀 can waste space due to underutilization. In practice, simulations suggest that 𝜀 should be set around 4 + r/4w to achieve a reasonable space-overhead balance for given parameters like r and w . By optimizing 𝜀, the filter's performance can be fine-tuned to maintain desired FP rates while minimizing space usage, crucial for scalable applications .
The construction algorithm of the Homogeneous Ribbon filter is characterized by its simplicity and guarantee of never failing, a distinction from other filter types which might require retries or iterative checking . This reliability is achieved by employing an algorithmic framework that closely manages the filter's balance in real-time during construction, ensuring constraints are met without exceeding space limits. The design not only ensures completion without issues but also simplifies the process compared to other more complex filters that tackle Gaussian eliminations or iterative constraint management . This contributes to a high level of reliability, making the filter suitable for applications where consistent performance without errors is critical .
Gaussian elimination and peeling methods differ primarily in their approach to resolving dependencies within data structures. Gaussian elimination involves a mathematical approach to solving systems of linear equations, allowing for precise and often seamless construction of complex filter configurations . In contrast, peeling employs a more heuristic method, often iteratively removing layers until constraints are satisfied. The benefit of using Gaussian elimination for Ribbon filters is its potential for greater space efficiency and speed, a significant advantage over peeling methods. The precision of Gaussian elimination minimizes retries and unnecessary allocations, achieving optimal space utilization and faster execution times, which are particularly beneficial for applications requiring high throughput and minimal latency .
Future research directions identified for enhancing the efficiency and understanding of Ribbon filters include deepening the theoretical understanding to better guide parameter choices in practice, managing the variance of false positive rates of Homogeneous Ribbon filters at small scales, and further exploring the design space of Balanced Ribbon filters to improve analytical ease and performance . Additional avenues involve creating SIMD-optimized implementations of Ribbon queries and exploring Ribbon's viability as a data structure for static functions, where the number of solution columns is high, such as r ≥ 32 . These developments aim to refine the configurability, performance, and scalability of Ribbon filters for broader application scenarios.