Author: | Wojciech Muła |
---|---|
Added on: | 2017-01-07 |
Updated on: | 2017-02-13 (correct the author), 2017-02-03 (link to 64-bit code), 2017-01-08 (spelling) |
Base64 algorithms were subject of my interest last year and I proposed a few different SIMD approaches for both encoding and encoding. This text sums up my experiments with ARM Neon implementation. For algorithms' details please refer to Base64 encoding with SIMD instructions and Base64 decoding with SIMD instructions.
ARM Neon is a pretty rich instruction set, it supports common SIMD logic and arithmetic operations, as well as vector lookups. The base register size is 64 bit (called doubleword), but there is support for 128-bit operations (quadword). Such a wide register is composed from two 64-bit registers, thus there is no cost of getting lower or higher part of quadword register; however at intrinsics level we need to use pseudo-functions vget_{low,high} or vcompose (to construct a new vector). All instructions accept doubleword registers, most work on quadwords.
Neon has one unique trait when compared with Intel's SIMD: it can load/store arrays of registers (up to four). Moreover it's possible to treat data as arrays of structures and interleave elements accordingly. In base64 we process three-byte blocks, and thanks to the interleaving we can easily load 8 (or 16) three-byte blocks, placing 0th bytes in the first register, 1st bytes in the second register and 2nd bytes in the third register.
Here is a sample code using intrinsics functions. The uint8x16x3_t type represents a sequence of three quadwords registers; no assembly instructions will be generated due to referencing sub-items of value having this (or similar) type.
// input bytes (3 * 16): [a0|b0|c0|a1|b1|c1|...|a15|b15|c15] const uint8x16x3_t in = vld3q_u8(input + i); // content of registers: // in.val[0] = [a0|a1|...|a15] // in.val[1] = [b0|b1|...|b15] // in.val[2] = [c0|c1|...|c15]
Basically, interleaved load and store are the most important, vital part of Neon implementations.
Base64 decoding requires input validation, i.e. a procedure which translates from ASCII into binary values must detect invalid ASCII letters. Intel SIMD has fast ways to communicate between the vector unit and the scalar unit, that's responsible also for jumps. Unfortunately, sending values from the Neon unit to the scalar unit is very slow. I overcome this limitation by using temporary memory buffer, which is independently read by the scalar code.
Neon supports a bit select instruction (x ? y : z) that appeared to be useful in a few places. (AMD XOP also provides such instruction, AVX512F comes with powerful ternary logic instruction.)
Neon supports also shifts at byte level that makes some bit operations easier.
Encoding is done in four steps:
The first step is accomplished with described earlier interleaved loading, we load 16 three-byte fields and after this step the first register has got bytes [aaaaaabb], the second one [bbbbcccc] and the third one [ccdddddd]. Using standard bitwise and, bitwise or and shifts we build four vectors having bytes [00aaaaaa], [00bbbbbb], [00cccccc] and [00dddddd].
Then we translate these values into ASCII codes. I literally rewritten three procedures from Intel SSE:
The last step, saving the result, i.e. four vectors, also uses interleaved transfer.
The whole encoding template.
template <typename LOOKUP_FN> void encode(LOOKUP_FN lookup, const uint8_t* input, size_t bytes, uint8_t* output) { uint8_t* out = output; for (size_t i = 0; i < bytes; i += 16*3) { // load 3 x 16 bytes, each three-byte block has following layout: // // [ccdddddd|bbbbcccc|aaaaaabb] // byte 2 byte 1 byte 0 const uint8x16x3_t in = vld3q_u8(input + i); // isolate fields into const uint8x16_t field_a = vshrq_n_u8(in.val[0], 2); const uint8x16_t field_b = vandq_u8( vorrq_u8(vshrq_n_u8(in.val[1], 4), vshlq_n_u8(in.val[0], 4)), packed_byte(0x3f)); const uint8x16_t field_c = vandq_u8( vorrq_u8(vshrq_n_u8(in.val[2], 6), vshlq_n_u8(in.val[1], 2)), packed_byte(0x3f)); const uint8x16_t field_d = vandq_u8(in.val[2], packed_byte(0x3f)); // translate fields uint8x16x4_t result; result.val[0] = lookup(field_a); result.val[1] = lookup(field_b); result.val[2] = lookup(field_c); result.val[3] = lookup(field_d); // save 4 x 16 bytes with interleave vst4q_u8(out, result); out += 64; } }
Base64 decoding requires four steps:
The first step uses — no surprise — interleaved loading; we load 16 four-characters chunks. Then each character is translated into 6-bit value, i.e. each of four register has bytes [00aaaaaa], [00bbbbbb], [00cccccc] and [00dddddd]. I rewritten two SSE procedures:
Now we compose three registers that will have bytes [aaaaaabb], [bbbbcccc] and [ccdddddd]. This is done with shifts and bitwise ors. The final step is to save these bytes, and again, we use interleaved store.
The decoding template.
template <typename FN_LOOKUP> void decode(FN_LOOKUP lookup, const uint8_t* input, size_t size, uint8_t* output) { uint8_t* out = output; uint8x8_t field_a; uint8x8_t field_b; uint8x8_t field_c; uint8x8_t field_d; uint8x8_t error_a; uint8x8_t error_b; uint8x8_t error_c; uint8x8_t error_d; union { uint8_t error_mem[8]; uint32_t error_word[2]; }; uint8x8x3_t result; for (size_t i=0; i < size; i += 8*4) { const uint8x8x4_t in = vld4_u8(input + i); field_a = lookup(in.val[0], error_a); field_b = lookup(in.val[1], error_b); field_c = lookup(in.val[2], error_c); field_d = lookup(in.val[3], error_d); const uint8x8_t error = vorr_u8(vorr_u8(error_a, error_b), vorr_u8(error_c, error_d)); vst1_u8(error_mem, error); if (error_word[0] | error_word[1]) { // report an error } result.val[0] = vorr_u8(vshr_n_u8(field_b, 4), vshl_n_u8(field_a, 2)); result.val[1] = vorr_u8(vshr_n_u8(field_c, 2), vshl_n_u8(field_b, 4)); result.val[2] = vorr_u8(field_d, vshl_n_u8(field_c, 6)); vst3_u8(out, result); out += 8*3; } }
As it was said, communication between the Neon unit and main unit is expansive. In the result error reporting path uses a temporary memory to save information.
I purchased Raspberry Pi 3 with Cortex-ARMv8, however, the operating system doesn't support the 64-bit mode yet. Everything was tested in the 32-bit mode.
My first implementations used only doubleword instructions (that performed pretty well), later I used quadword in both encoding and decoding. Performance of encoding become better, but decoding was almost 40% slower. I suppose that I omitted something due to lack of experience. Any hints are warmly welcomed.
CPU: ARMv7 Processor rev 4 (v7l)
Compiler: GCC 4.9.2
procedure | time [s] | speedup | |
---|---|---|---|
scalar (32 bit) | 1.06887 | 1.00 | ██████████ |
ARM NEON (naive lookup) | 0.43296 | 2.47 | ████████████████████████▋ |
ARM NEON (optimized lookup) | 0.38450 | 2.78 | ███████████████████████████▊ |
ARM NEON (pshufb improved lookup) | 0.32139 | 3.33 | █████████████████████████████████▎ |
procedure | lookup | time [s] | speedup | |
---|---|---|---|---|
improved scalar | N/A | 0.24247 | 1.00 | ██████████ |
scalar | N/A | 0.49157 | 0.49 | ████▉ |
ARM NEON | byte blend | 0.14298 | 1.70 | ████████████████▉ |
ARM NEON | pshufb bitmask | 0.10574 | 2.29 | ██████████████████████▉ |
Github repository contains all procedures and sample programs.