ARM Neon and Base64 encoding & decoding

Author: Wojciech Muła 2017-01-07 2017-02-13 (correct the author), 2017-02-03 (link to 64-bit code), 2017-01-08 (spelling)

Introduction

Base64 algorithms were subject of my interest last year and I proposed a few different SIMD approaches for both encoding and encoding. This text sums up my experiments with ARM Neon implementation. For algorithms' details please refer to Base64 encoding with SIMD instructions and Base64 decoding with SIMD instructions.

ARM Neon

ARM Neon is a pretty rich instruction set, it supports common SIMD logic and arithmetic operations, as well as vector lookups. The base register size is 64 bit (called doubleword), but there is support for 128-bit operations (quadword). Such a wide register is composed from two 64-bit registers, thus there is no cost of getting lower or higher part of quadword register; however at intrinsics level we need to use pseudo-functions vget_{low,high} or vcompose (to construct a new vector). All instructions accept doubleword registers, most work on quadwords.

Neon has one unique trait when compared with Intel's SIMD: it can load/store arrays of registers (up to four). Moreover it's possible to treat data as arrays of structures and interleave elements accordingly. In base64 we process three-byte blocks, and thanks to the interleaving we can easily load 8 (or 16) three-byte blocks, placing 0th bytes in the first register, 1st bytes in the second register and 2nd bytes in the third register.

Here is a sample code using intrinsics functions. The uint8x16x3_t type represents a sequence of three quadwords registers; no assembly instructions will be generated due to referencing sub-items of value having this (or similar) type.

```// input bytes (3 * 16): [a0|b0|c0|a1|b1|c1|...|a15|b15|c15]
const uint8x16x3_t in = vld3q_u8(input + i);

// content of registers:
// in.val[0] = [a0|a1|...|a15]
// in.val[1] = [b0|b1|...|b15]
// in.val[2] = [c0|c1|...|c15]
```

Basically, interleaved load and store are the most important, vital part of Neon implementations.

Conditions

Base64 decoding requires input validation, i.e. a procedure which translates from ASCII into binary values must detect invalid ASCII letters. Intel SIMD has fast ways to communicate between the vector unit and the scalar unit, that's responsible also for jumps. Unfortunately, sending values from the Neon unit to the scalar unit is very slow. I overcome this limitation by using temporary memory buffer, which is independently read by the scalar code.

Bit operations

Neon supports a bit select instruction (x ? y : z) that appeared to be useful in a few places. (AMD XOP also provides such instruction, AVX512F comes with powerful ternary logic instruction.)

Neon supports also shifts at byte level that makes some bit operations easier.

Base64 encoding

Encoding is done in four steps:

• load three-byte blocks having bit layout [ccdddddd|bbbbcccc|aaaaaabb];
• isolate four 6-bit values, the fields a, b, c and d;
• translate them into ASCII codes;
• save the text.

The first step is accomplished with described earlier interleaved loading, we load 16 three-byte fields and after this step the first register has got bytes [aaaaaabb], the second one [bbbbcccc] and the third one [ccdddddd]. Using standard bitwise and, bitwise or and shifts we build four vectors having bytes [00aaaaaa], [00bbbbbb], [00cccccc] and [00dddddd].

Then we translate these values into ASCII codes. I literally rewritten three procedures from Intel SSE:

• lookup_naive — the most naive, comparison-based approach;
• lookup_version2 — also uses comparisons, but is a bit smarter;
• lookup_pshufb_improved — uses vector lookup instruction vtbl (I kept name pshufb, though). The only important note is that vtbl doesn't support quadword registers, so two invocation of the instruction are needed.

The last step, saving the result, i.e. four vectors, also uses interleaved transfer.

The whole encoding template.

```template <typename LOOKUP_FN>
void encode(LOOKUP_FN lookup, const uint8_t* input, size_t bytes, uint8_t* output) {

uint8_t* out = output;

for (size_t i = 0; i < bytes; i += 16*3) {
// load 3 x 16 bytes, each three-byte block has following layout:
//
//      [ccdddddd|bbbbcccc|aaaaaabb]
//        byte 2   byte 1   byte 0
const uint8x16x3_t in = vld3q_u8(input + i);

// isolate fields into
const uint8x16_t field_a = vshrq_n_u8(in.val[0], 2);
const uint8x16_t field_b = vandq_u8(
vorrq_u8(vshrq_n_u8(in.val[1], 4), vshlq_n_u8(in.val[0], 4)),
packed_byte(0x3f));
const uint8x16_t field_c = vandq_u8(
vorrq_u8(vshrq_n_u8(in.val[2], 6), vshlq_n_u8(in.val[1], 2)),
packed_byte(0x3f));
const uint8x16_t field_d = vandq_u8(in.val[2], packed_byte(0x3f));

// translate fields
uint8x16x4_t result;
result.val[0] = lookup(field_a);
result.val[1] = lookup(field_b);
result.val[2] = lookup(field_c);
result.val[3] = lookup(field_d);

// save 4 x 16 bytes with interleave
vst4q_u8(out, result);
out += 64;
}
}
```

Base64 decoding

Base64 decoding requires four steps:

• translate the characters into 6-bit values;
• pack four 6-bit values into 24-bit words;
• save 24-bit words.

The first step uses — no surprise — interleaved loading; we load 16 four-characters chunks. Then each character is translated into 6-bit value, i.e. each of four register has bytes [00aaaaaa], [00bbbbbb], [00cccccc] and [00dddddd]. I rewritten two SSE procedures:

• lookup_byte_blend — naive, comparison-based algorithm, it uses the bit select instruction to blend vectors;
• lookup_pshufb_bitmask — very compact approach heavily using the vector lookup instruction.

Now we compose three registers that will have bytes [aaaaaabb], [bbbbcccc] and [ccdddddd]. This is done with shifts and bitwise ors. The final step is to save these bytes, and again, we use interleaved store.

The decoding template.

```template <typename FN_LOOKUP>
void decode(FN_LOOKUP lookup, const uint8_t* input, size_t size, uint8_t* output) {

uint8_t* out = output;

uint8x8_t field_a;
uint8x8_t field_b;
uint8x8_t field_c;
uint8x8_t field_d;

uint8x8_t error_a;
uint8x8_t error_b;
uint8x8_t error_c;
uint8x8_t error_d;

union {
uint8_t  error_mem[8];
uint32_t error_word[2];
};

uint8x8x3_t result;

for (size_t i=0; i < size; i += 8*4) {

const uint8x8x4_t in = vld4_u8(input + i);

field_a = lookup(in.val[0], error_a);
field_b = lookup(in.val[1], error_b);
field_c = lookup(in.val[2], error_c);
field_d = lookup(in.val[3], error_d);

const uint8x8_t error = vorr_u8(vorr_u8(error_a, error_b), vorr_u8(error_c, error_d));

vst1_u8(error_mem, error);
if (error_word[0] | error_word[1]) {
// report an error
}

result.val[0] = vorr_u8(vshr_n_u8(field_b, 4), vshl_n_u8(field_a, 2));
result.val[1] = vorr_u8(vshr_n_u8(field_c, 2), vshl_n_u8(field_b, 4));
result.val[2] = vorr_u8(field_d, vshl_n_u8(field_c, 6));

vst3_u8(out, result);
out += 8*3;
}
}
```

As it was said, communication between the Neon unit and main unit is expansive. In the result error reporting path uses a temporary memory to save information.

Performance results

I purchased Raspberry Pi 3 with Cortex-ARMv8, however, the operating system doesn't support the 64-bit mode yet. Everything was tested in the 32-bit mode.

My first implementations used only doubleword instructions (that performed pretty well), later I used quadword in both encoding and decoding. Performance of encoding become better, but decoding was almost 40% slower. I suppose that I omitted something due to lack of experience. Any hints are warmly welcomed.

CPU: ARMv7 Processor rev 4 (v7l)

Compiler: GCC 4.9.2

procedure time [s] speedup
scalar (32 bit) 1.06887 1.00 ██████████
ARM NEON (naive lookup) 0.43296 2.47 ████████████████████████▋
ARM NEON (optimized lookup) 0.38450 2.78 ███████████████████████████▊
ARM NEON (pshufb improved lookup) 0.32139 3.33 █████████████████████████████████▎

Decoding (doublewords)

procedure lookup time [s] speedup
improved scalar N/A 0.24247 1.00 ██████████
scalar N/A 0.49157 0.49 ████▉
ARM NEON byte blend 0.14298 1.70 ████████████████▉
ARM NEON pshufb bitmask 0.10574 2.29 ██████████████████████▉