Author: | Wojciech Muła |
---|---|
Added on: | 2015-12-27 |
Updates: | 2016-01-17 (results from Core i7, BMI2, 64-bit variant of version B (case #7)) |
Contents
In a basic step of the Base64 encoding three bytes are processed producing four output bytes. The step consist following stages:
In this text different implementations of the procedure are examined.
The naive way is to load 3 bytes and then combine them to 4 indices:
const uint8_t x0 = input[0]; const uint8_t x1 = input[1]; const uint8_t x2 = input[2]; const uint8_t a = x0 & 0x3f; const uint8_t b = ((x1 & 0xf) << 2) | (x0 >> 6); const uint8_t c = ((x2 & 0x3) << 4) | (x1 >> 4); const uint8_t d = x2 >> 2;
This solution requires following operations:
Although some of the instructions could be executed in parallel (this depends on a CPU) the overall speed of the code is not very big.
Possible modification is to load at once 4 bytes (we use just 3 of them) and then easily extract 6-bit indices:
const uint32_t x = input[0..3]; const uint8_t a = x & 0x3f; const uint8_t b = (x >> 8) & 0x3f; const uint8_t c = (x >> 16) & 0x3f; const uint8_t d = (x >> 24) & 0x3f;
The modified version requires:
A drawback is that 4-bytes chunk is loaded, it's possible to read off the available process space; this is the drawback of all SWAR/SIMD algorithms.
On 64-bit machines an input word could be 64-bit. Thanks to that the number of load decreases two times. However, only 6 bytes is used per iteration.
The common scheme in this optimization is right shift followed by masking 6 lowest bits. Bit Manipulation Instruction comes with an instruction bext doing exactly the described operation.
Again, the basic solution is sequentially translate and save separate bytes:
*out++ = lookup8[a]; *out++ = lookup8[b]; *out++ = lookup8[c]; *out++ = lookup8[d];
Memory loads are needed to properly translate the indices, thus 4 memory loads are unavoidable.
The only positive aspect is that size of lookup8 is only 64 bytes, it perfectly fits in the cache line.
This solution requires:
Modification of this step is similar to loading, i.e. we build a 32-bit result and store the dword using just one instruction:
const uint32_t w = uint32_t(lookup8[a]) | (uint32_t(lookup8[b]) << 8) | (uint32_t(lookup8[c]) << 16) | (uint32_t(lookup8[d]) << 24); output[0..3] = w;
This modification requires:
Bit shifts introduced in the above version could be avoided if special lookups were introduced. For each index separate, 32-bit lookup is required, and then the code simplifies to:
const uint32_t w = uint32_t(lookup32_0[a]) | uint32_t(lookup32_1[b]) | uint32_t(lookup32_2[c]) | uint32_t(lookup32_3[d]); outout[0..3] = w;
The obvious negative point of this solution is much larger lookups (1 kB) than the lookup used in the base version (64 bytes).
Sample code is available at github.
The test program converts 32 MiB of data; this is repeated 10 times and the minimum time is printed.
# | extract indices from ... | output | tims [s] | speedup | |
---|---|---|---|---|---|
1 | ... bytes (A) | save bytes (X) | 0.121 | 100.0% | ██████████ |
2 | construct dword from 1 lookup table (Y) | 0.146 | 82.9% | ████████ | |
3 | construct dword from 4 lookup tables (Z) | 0.124 | 97.6% | █████████ | |
4 | ... 32-bit word (B) | save bytes (X) | 0.095 | 127.4% | █████████████ |
5 | construct dword from 1 lookup table (Y) | 0.122 | 99.2% | ██████████ | |
6 | construct dword from 4 lookup tables (Z) | 0.099 | 122.2% | ████████████ | |
7 | ... 64-bit word (B) | save bytes (X) | 0.070 | 172.8% | █████████████████ |
Summary:
# | extract indices from ... | output | tims [s] | speedup | |
---|---|---|---|---|---|
1 | ... bytes (A) | save bytes (X) | 0.082 | 100.0% | ██████████ |
2 | construct dword from 1 lookup table (Y) | 0.082 | 100.0% | ██████████ | |
3 | construct dword from 4 lookup tables (Z) | 0.073 | 122.3% | ████████████ | |
4 | ... 32-bit word (B) | save bytes (X) | 0.056 | 146.4% | ██████████████ |
5 | construct dword from 1 lookup table (Y) | 0.064 | 128.1% | █████████████ | |
6 | construct dword from 4 lookup tables (Z) | 0.050 | 164.0% | ████████████████ | |
7 | ... 64-bit word (B) | save bytes (X) | 0.043 | 190.7% | ███████████████████ |
8 | ... 64-bit word & bextr (B) | save bytes (X) | 0.060 | 136.7% | █████████████ |
9 | ... 64-bit word & bextr (B) | construct 64-bit words from 1 lookup | 0.072 | 113.9% | ███████████ |
Summary: