Consistently faster and smaller compressed bitmaps with Roaring (2016)

o11c 8 months ago

Title is confusing: this is not about the original "Roaring", but an extension of it called "Roaring+Run".

Here, "bitmap" = "set of sometimes-compact integers". The "uncompressed" and several "rle" implementations are obvious. Hm, except this only seems to be talking about a particularly-naive RLE approach (suitable for storage but not computation)? If you're doing computation I expect you to use absolute offsets rather than relative ones which means you can just do binary search (the only downside of this is that you can't use variable-length integers now).

Roaring is just a fixed 2-level trie, where the outer node is always an array of pointers and where the inner nodes can be either uncompressed bitvectors (if dense) or an array of low-half integers (if sparse). Also, it only works for 32-bit integers at a fundamental level; significant changes are needed for 64-bit integers.

This paper adds a third representation for the inner node, the bsearch'able absolute RLE method I mentioned earlier (before even reading the paper beyond the abstract).

Overall there's neither anything novel nor anything particularly exciting about this paper, but if you ignore all the self-congratulations it might work as a decent intro to the subject? Except maybe not since there are a lot of things it fails to mention (the ping-pong problem, deduplicated tries, the approach of using a separate set for sparse values in the same range, the "overall sparse but locally semi-dense" representation that uses fixed-size single-word bitsets, ...)

Drup 8 months ago

You seem well versed into that corner. Do you have a good (and reasonably complete) introduction/exploration for these memory-efficient data-structure for computation ?
I've been working on memory representation of algebraic data types quite a bit, and I've always wondered if we could combine them with succinct data-structures.
- pram 8 months ago
  
  Theres actually a whole website about it! I found it useful when I was doing deeper research into ElasticSearch: https://roaringbitmap.org

softwaredoug 8 months ago

Roaring bitmaps are really useful for doing phrase search in search engines.

Basically you can find cases where one term is immediately before another by left shifting the right terms roaring encoded positions in all documents and bitwise-anding the similarly roaring-encoded payload of the preceding term. All with a highly compressed representation of term positions.

With something like numpy you can do this in a handful of logical operations in python.

https://softwaredoug.com/blog/2024/01/21/search-array-phrase...

pncnmnp 8 months ago

Hey everyone, if you're looking for a more approachable guide on bitmap compression, I wrote a blog post on it this year: https://pncnmnp.github.io/blogs/roaring-bitmaps.html. It covers run-length encoding, BBC, WAH, Concise, and even Roaring Bitmaps.

skybrian 8 months ago

That’s a good read, thanks! The history is interesting, but in practice, I’m wondering if there’s any reason not to skip it and just learn about roaring bitmaps?
- pncnmnp 8 months ago
  
  Absolutely! For inspiration on how roaring bitmaps can be used in practice, check out https://roaringbitmap.org/. The blog post is part of a book I am writing on obscure data structures (https://pncnmnp.github.io/blogs/dsa-book.html), so explaining the history was a way to dive into the evolution of this topic and the limitations of each implementation, in order to motivate a SOTA.

rurban 8 months ago

I found them consistently slower in unary unicode tables, with sizes from 1.000 to 30.000. Binary search and hashtables were both better.

The reason might be that ordered search tables and perfect hashtables can be dumped to C code statically, roaring not, so there is the dynamic alloc and deserialization overhead.