Base64 encoding — implementation study

Author:	Wojciech Muła
Added on:	2015-12-27
Updates:	2016-01-17 (results from Core i7, BMI2, 64-bit variant of version B (case #7))

Contents

Introduction
Loading data and extracting indices (points 1 & 2)
- Base (version A)
- Improvement (version B)
  - 64-bit code
  - BMI2
Translating and saving the result (points 3 & 4)
Comparison
- Results from Core i5
- Results from Core i6
See also

Introduction

In a basic step of the Base64 encoding three bytes are processed producing four output bytes. The step consist following stages:

load 3 bytes;
split these 3 bytes to four 6-bit indices;
translate the indices using a lookup table;
save four bytes.

In this text different implementations of the procedure are examined.

Loading data and extracting indices (points 1 & 2)

Base (version A)

The naive way is to load 3 bytes and then combine them to 4 indices:

const uint8_t x0 = input[0];
const uint8_t x1 = input[1];
const uint8_t x2 = input[2];

const uint8_t a = x0 & 0x3f;
const uint8_t b = ((x1 & 0xf) << 2) | (x0 >> 6);
const uint8_t c = ((x2 & 0x3) << 4) | (x1 >> 4);
const uint8_t d = x2 >> 2;

This solution requires following operations:

3 memory loads;
3 bit-ands;
2 bit-ors
5 bit-shifts.

Although some of the instructions could be executed in parallel (this depends on a CPU) the overall speed of the code is not very big.

Improvement (version B)

Possible modification is to load at once 4 bytes (we use just 3 of them) and then easily extract 6-bit indices:

const uint32_t x = input[0..3];

const uint8_t a = x & 0x3f;
const uint8_t b = (x >> 8) & 0x3f;
const uint8_t c = (x >> 16) & 0x3f;
const uint8_t d = (x >> 24) & 0x3f;

The modified version requires:

1 memory load;
4 bit-ands;
3 bit-shifts;

A drawback is that 4-bytes chunk is loaded, it's possible to read off the available process space; this is the drawback of all SWAR/SIMD algorithms.

64-bit code

On 64-bit machines an input word could be 64-bit. Thanks to that the number of load decreases two times. However, only 6 bytes is used per iteration.

BMI2

The common scheme in this optimization is right shift followed by masking 6 lowest bits. Bit Manipulation Instruction comes with an instruction bext doing exactly the described operation.

Translating and saving the result (points 3 & 4)

Base (version X)

Again, the basic solution is sequentially translate and save separate bytes:

*out++ = lookup8[a];
*out++ = lookup8[b];
*out++ = lookup8[c];
*out++ = lookup8[d];

Memory loads are needed to properly translate the indices, thus 4 memory loads are unavoidable.

The only positive aspect is that size of lookup8 is only 64 bytes, it perfectly fits in the cache line.

This solution requires:

4 memory loads (4 lookups);
4 memory stores.

Improvement (version Y)

Modification of this step is similar to loading, i.e. we build a 32-bit result and store the dword using just one instruction:

const uint32_t w =  uint32_t(lookup8[a])
                 | (uint32_t(lookup8[b]) << 8)
                 | (uint32_t(lookup8[c]) << 16)
                 | (uint32_t(lookup8[d]) << 24);

output[0..3] = w;

This modification requires:

4 memory loads (4 lookups);
3 bit-ors;
3 bit-shifts;
1 memory store.

Improvement (version Z)

Bit shifts introduced in the above version could be avoided if special lookups were introduced. For each index separate, 32-bit lookup is required, and then the code simplifies to:

const uint32_t w = uint32_t(lookup32_0[a])
                 | uint32_t(lookup32_1[b])
                 | uint32_t(lookup32_2[c])
                 | uint32_t(lookup32_3[d]);

outout[0..3] = w;

The obvious negative point of this solution is much larger lookups (1 kB) than the lookup used in the base version (64 bytes).

Comparison

Sample code is available at github.

The test program converts 32 MiB of data; this is repeated 10 times and the minimum time is printed.

Results from Core i5

#	extract indices from ...	output	tims [s]	speedup
1	... bytes (A)	save bytes (X)	0.121	100.0%	`██████████`
2		construct dword from 1 lookup table (Y)	0.146	82.9%	`████████`
3		construct dword from 4 lookup tables (Z)	0.124	97.6%	`█████████`
4	... 32-bit word (B)	save bytes (X)	0.095	127.4%	`█████████████`
5		construct dword from 1 lookup table (Y)	0.122	99.2%	`██████████`
6		construct dword from 4 lookup tables (Z)	0.099	122.2%	`████████████`
7	... 64-bit word (B)	save bytes (X)	0.070	172.8%	`█████████████████`

Summary:

The only factor causing significant speedup is minimizing memory reads, i.e. using variant B for input. Case 7 is the brightest example.
Fancy methods building a result at once do not help at all.
The biggest speedup is merely 25% and I guess the memory throughput limits further improvement.

Results from Core i6

#	extract indices from ...	output	tims [s]	speedup
1	... bytes (A)	save bytes (X)	0.082	100.0%	`██████████`
2		construct dword from 1 lookup table (Y)	0.082	100.0%	`██████████`
3		construct dword from 4 lookup tables (Z)	0.073	122.3%	`████████████`
4	... 32-bit word (B)	save bytes (X)	0.056	146.4%	`██████████████`
5		construct dword from 1 lookup table (Y)	0.064	128.1%	`█████████████`
6		construct dword from 4 lookup tables (Z)	0.050	164.0%	`████████████████`
7	... 64-bit word (B)	save bytes (X)	0.043	190.7%	`███████████████████`
8	... 64-bit word & bextr (B)	save bytes (X)	0.060	136.7%	`█████████████`
9	... 64-bit word & bextr (B)	construct 64-bit words from 1 lookup	0.072	113.9%	`███████████`