GitHunt
JH

jhorstmann/vbmi-bitpack-decoder

Blazing fast bitpack decoding using avx512-vbmi instructions

Blazing fast bitpack decoding using avx512-vbmi instructions

The avx512-vbmi instruction set introduces two instructions that are very useful for
decoding bitpacked data.

  • vpermb (__mm512_permutexvar_epi8): Shuffles 8-bit integers across lanes.
  • vpmultishiftqb (_mm512_multishift_epi64_epi8): Creates 8 8-bit integers
    per 64-bit lane by shifting that 64-bit word by an arbitrary amount.

Combining these two instructions with masked loads allows to build a fully table-driven bitpack
decoder for arbitrary bit widths up to 30. In Parquet, bit width encoding works on blocks of 8
values, which ensures that each block ends on a byte boundary.

Since an avx512 register can hold 32 16-bit values, we can process 4 blocks into 16-bit values
per step. For 8-bit values we could process 8 blocks at once, for 32-bit values it would be 2
blocks per step.

Output Type Blocks per step Input bytes per step Output words per step
8-bit 8 bitwidth * 8 64
16-bit 4 bitwidth * 4 32
32-bit 2 bitwidth * 2 16

The general algorithm works as follows:

  • We load the bytes needed for each decode step. This depends on the bit width and the number
    of blocks per step explained above. The load mask needed can be calculated once up front.
  • We shuffle these bytes across the whole 512-bit register using vpermb, so that for each
    output byte the required bits are inside the same 64-bit lane.
  • This allow vpmultishiftqb to reassemble the output bytes.
  • We need to mask out high bits of each output word using vpandq.
  • We store the resulting 512-bit register to memory.

This means that excluding loads/stores, there are only 3 operations needed per 32 16-bit words,
64 8-bit values or 16 32-bit values. All these operations are independent of the bit width, the
load mask, permutation indices and shift amounts can be calculated or looked up in a table but
the operations themselves do not change. This is a huge difference to many other bitpack decoders,
which require generating specific code sequences per bit width.

The algorithm also does not require padding the input to larger power-of-2 sized chunks. For
arbitrary input sizes there needs to be additional code to handle the tail that is not a multiple
of the input bytes per step. This does not change the algorithm steps, except that we need to
calculate a separate load and store mask that takes the amount of remaining input bytes and
output words into account.

Example

As an example, consider a block of values encoded using a bit width of 3. The bits corresponding
to an output word are marked by letters corresponding to that word from A-H. Bytes are separated
by space and bits per bytes are written starting from the most significant one.

CCBBBAAA FEEEDDDC HHHGGGFF

Since we use a masked load, the remaining bytes in the register will be set to all zeroes.
Since we are only decoding 8 values, the following description will only look at the lower
128 bits of the avx512 register.

After shuffling these bytes are distributed in the block like this:

CCBBBAAA FEEEDDDC HHHGGGFF 00000000 00000000 00000000 00000000 00000000
FEEEDDDC HHHGGGFF 00000000 00000000 00000000 00000000 00000000 00000000

To reassemble into 16-bit words we now need to, for each byte, select sequence of 8 bits
starting from a certain position inside its 64-bit lane. The shift amounts for this example
then are:

0 8 3 11 6 14 9 17
4 12 7 15 10 18 13 21

Giving the resulting bits:

CCBBBAAA FEEEDDDC DDCCCBBB GFFFEEED EEDDDCCC HGGGFFFE FFEEEDDD 0HHHGGGF
GGFFFEEE 0000HHHG HHGGGFFF 0000000H 00HHHGGG 00000000 00000HHH 00000000

Masking each 16-bit word to only retain 3 bits then results in:

00000AAA 00000000 00000BBB 00000000 00000CCC 00000000 00000DDD 00000000
00000EEE 00000000 00000FFF 00000000 00000GGG 00000000 00000HHH 00000000

Performance

On an Intel Tigerlake maching (i9-11900KB), this algorithm can decode batches of 4096 16-bit values
At a throughput of around 52 billion elements per second, and so is mostly limited by memory
bandwidth. An AMD Zen5 (Ryzen 9 HX PRO 370) reaches a throughput of 70 billion elements per second.

Languages

Rust100.0%

Contributors

Created February 26, 2026
Updated March 3, 2026