Base64 encoding — implementation study

Author: Wojciech Muła
Added on:2015-12-27
Updates:2016-01-17 (results from Core i7, BMI2, 64-bit variant of version B (case #7))

Contents

Introduction

In a basic step of the Base64 encoding three bytes are processed producing four output bytes. The step consist following stages:

  1. load 3 bytes;
  2. split these 3 bytes to four 6-bit indices;
  3. translate the indices using a lookup table;
  4. save four bytes.

In this text different implementations of the procedure are examined.

Loading data and extracting indices (points 1 & 2)

Base (version A)

The naive way is to load 3 bytes and then combine them to 4 indices:

const uint8_t x0 = input[0];
const uint8_t x1 = input[1];
const uint8_t x2 = input[2];

const uint8_t a = x0 & 0x3f;
const uint8_t b = ((x1 & 0xf) << 2) | (x0 >> 6);
const uint8_t c = ((x2 & 0x3) << 4) | (x1 >> 4);
const uint8_t d = x2 >> 2;

This solution requires following operations:

  • 3 memory loads;
  • 3 bit-ands;
  • 2 bit-ors
  • 5 bit-shifts.

Although some of the instructions could be executed in parallel (this depends on a CPU) the overall speed of the code is not very big.

Improvement (version B)

Possible modification is to load at once 4 bytes (we use just 3 of them) and then easily extract 6-bit indices:

const uint32_t x = input[0..3];

const uint8_t a = x & 0x3f;
const uint8_t b = (x >> 8) & 0x3f;
const uint8_t c = (x >> 16) & 0x3f;
const uint8_t d = (x >> 24) & 0x3f;

The modified version requires:

  • 1 memory load;
  • 4 bit-ands;
  • 3 bit-shifts;

A drawback is that 4-bytes chunk is loaded, it's possible to read off the available process space; this is the drawback of all SWAR/SIMD algorithms.

64-bit code

On 64-bit machines an input word could be 64-bit. Thanks to that the number of load decreases two times. However, only 6 bytes is used per iteration.

BMI2

The common scheme in this optimization is right shift followed by masking 6 lowest bits. Bit Manipulation Instruction comes with an instruction bext doing exactly the described operation.

Translating and saving the result (points 3 & 4)

Base (version X)

Again, the basic solution is sequentially translate and save separate bytes:

*out++ = lookup8[a];
*out++ = lookup8[b];
*out++ = lookup8[c];
*out++ = lookup8[d];

Memory loads are needed to properly translate the indices, thus 4 memory loads are unavoidable.

The only positive aspect is that size of lookup8 is only 64 bytes, it perfectly fits in the cache line.

This solution requires:

  • 4 memory loads (4 lookups);
  • 4 memory stores.

Improvement (version Y)

Modification of this step is similar to loading, i.e. we build a 32-bit result and store the dword using just one instruction:

const uint32_t w =  uint32_t(lookup8[a])
                 | (uint32_t(lookup8[b]) << 8)
                 | (uint32_t(lookup8[c]) << 16)
                 | (uint32_t(lookup8[d]) << 24);

output[0..3] = w;

This modification requires:

  • 4 memory loads (4 lookups);
  • 3 bit-ors;
  • 3 bit-shifts;
  • 1 memory store.

Improvement (version Z)

Bit shifts introduced in the above version could be avoided if special lookups were introduced. For each index separate, 32-bit lookup is required, and then the code simplifies to:

const uint32_t w = uint32_t(lookup32_0[a])
                 | uint32_t(lookup32_1[b])
                 | uint32_t(lookup32_2[c])
                 | uint32_t(lookup32_3[d]);

outout[0..3] = w;

The obvious negative point of this solution is much larger lookups (1 kB) than the lookup used in the base version (64 bytes).

Comparison

Sample code is available at github.

The test program converts 32 MiB of data; this is repeated 10 times and the minimum time is printed.

Results from Core i5

# extract indices from ... output tims [s] speedup  
1 ... bytes (A) save bytes (X) 0.121 100.0% ██████████
2 construct dword from 1 lookup table (Y) 0.146 82.9% ████████
3 construct dword from 4 lookup tables (Z) 0.124 97.6% █████████
4 ... 32-bit word (B) save bytes (X) 0.095 127.4% █████████████
5 construct dword from 1 lookup table (Y) 0.122 99.2% ██████████
6 construct dword from 4 lookup tables (Z) 0.099 122.2% ████████████
7 ... 64-bit word (B) save bytes (X) 0.070 172.8% █████████████████

Summary:

  • The only factor causing significant speedup is minimizing memory reads, i.e. using variant B for input. Case 7 is the brightest example.
  • Fancy methods building a result at once do not help at all.
  • The biggest speedup is merely 25% and I guess the memory throughput limits further improvement.

Results from Core i6

# extract indices from ... output tims [s] speedup  
1 ... bytes (A) save bytes (X) 0.082 100.0% ██████████
2 construct dword from 1 lookup table (Y) 0.082 100.0% ██████████
3 construct dword from 4 lookup tables (Z) 0.073 122.3% ████████████
4 ... 32-bit word (B) save bytes (X) 0.056 146.4% ██████████████
5 construct dword from 1 lookup table (Y) 0.064 128.1% █████████████
6 construct dword from 4 lookup tables (Z) 0.050 164.0% ████████████████
7 ... 64-bit word (B) save bytes (X) 0.043 190.7% ███████████████████
8 ... 64-bit word & bextr (B) save bytes (X) 0.060 136.7% █████████████
9 ... 64-bit word & bextr (B) construct 64-bit words from 1 lookup 0.072 113.9% ███████████

Summary:

  • Unlike i5, on Core i7 there is speedup caused by minimizing number of 32-bit writes (clearly visible in cases 3 and 6).
  • However coalescing writes to 64-bit word (case 10) has negative impact.
  • The biggest boost is due to use 64-bit loads (case 7).
  • Using the new shiny instruction bextr doesn't help. I'm underwhelmed and shocked. According to the manuals, throughput and latency are very small, the instruction replaces two other (including quite expansive shift). What a shame.

See also