Author: | Wojciech Muła |
---|---|
Added on: | 2014-10-02 |
Updated: | 2014-10-07 (core i7 results); 2014-10-05 (lookup table methods) |
Contents
Conversion to octal isn't very computer-friendly, each digit occupy 3 bits, it isn't power of two. The smallest number of bits to convert worth to consider is 12 (3 nibbles and 4 digits):
+--------------+ |dddc ccbb baaa| +--------------+ 8..11 4..7 0..3
Then output is a 32-bit word:
+--------+--------+--------+--------+ |00000ddd|00000ccc|00000bbb|00000aaa| +--------+--------+--------+--------+
Conversion to ASCII require single add of 0x30303030 (0x30 = ord('0')).
Naive conversion consists three basic steps:
Sample implementation:
#define packed32(x) ((uint8_t)(x) * (uint32_t)(0x01010101)) uint32_t expand_12bits_naive(uint32_t x) { const uint32_t a = (x & 0x007) << 24; const uint32_t b = (x & 0x038) << (16 - 3); const uint32_t c = (x & 0x1c0) << (8 - 6); const uint32_t d = (x & 0xe00) >> 9; return a | b | c | d; } uint32_t to_oct_naive(uint32_t x) { return expand_12bits_naive(x) + packed32('0'); }
I think using lookup tables aren't good solution. First: shifting on current CPUs is very fast, second: LUT requires memory and pollutes cache. The results from my old Pentium M shows that LUT are the fastest, but I bet that PDEP or SSE2 version win on Core i7 .
Two implementations of LUT were tested:
Variant 1 sample code:
uint32_t lookup[1 << 12]; void prepare_single_table() { for (uint32_t x=0; x <= 0xfff; x++) { lookup[x] = to_oct_naive(x); } } uint32_t to_oct_single_table(uint32_t x) { return lookup[x]; // assert x <= 0xfff; }
Variant 2 sample code:
uint32_t lookup_nibbles01[256]; uint32_t lookup_nibble2[16]; void prepare_single_table() { for (uint32_t i=0; i <= 256; i++) { lookup_nibbles01[i] = expand_12bits_naive(i); } for (uint32_t i=0; i <= 16; i++) { lookup_nibble2[i] = expand_12bits_naive(i << 8); } } uint32_t to_oct_two_tables(uint32_t x) { const uint32_t expanded = lookup_nibbles01[x & 0xff] | lookup_nibble2[(x >> 8) & 0xf]; return expanded + packed32('0'); }
Variant 2 requires less memory than variant 1 (only 6%), but performs 4 operations to calculate result (2 bit-and, 1 shift and add) and issues 2 memory fetches.
Using single multiplication we can perform two shifts and one bit-or. Input word have to be split to two subwords: containing digits a, c and b, d.
For example digits a, c are converted in following steps:
isolate digits (bit-and):
ac = x & 0x1c7
+--------+--------+--------+--------+ |00000000|00000000|0000000c|cc000aaa| +--------+--------+--------+--------+
the digit a is on the correct position, but c have to be shifted 15 bits to the left. This could be saved as:
expanded_ac = (ac << 15) + (ac << 0)
Or using multiplication:
expanded_ac = ac * ((1 << 15) + 1)
Layout of expanded_ac:
+--------+--------+--------+--------+ |00000000|00000ccc|000aaa0c|cc000aaa| +--------+--------+--------+--------+
Downside of this method is that digits are generated in reversed order, so the instruction bswap have to be used on the result.
Sample implementation:
uint32_t expand_12bits_mul(uint32_t x) { uint32_t ac = x & 0x1c7; // ac = 000ccc000aaa uint32_t bd = x & 0xe38; // bd = ddd000bbb000 uint32_t expanded_ac = (ac * ((1 << 0) + (1 << 10))) & 0x00070007; uint32_t expanded_bd = (bd * ((1 << 5) + (1 << 15))) & 0x07000700; return expanded_ac | expanded_bd; } uint32_t to_oct_mul(uint32_t x) { return bswap(expand_12bits_mul(x) + packed32('0')); }
PDEP is a new instruction introduced in BMI2 extension. Please take a look at my previous article where this instructions is discussed in details. PDEP makes conversion easy and very fast — code have just three instructions (pdep, add, bswap).
Sample implementation:
uint32_t to_oct_pdep(uint32_t x) { const uint32_t result = pdep(x, 0x07070707); return bswap(result + packed32('0')); }
SSE2 does in parallel the same bit-and & shifts operations as the naive version.
Variable shifts are possible with multiplication, instruction PMULHUW multiplies 16-bit words and store higher word of a result. When multiplier is in form (1 << n) then result of this instruction is x >> (16 - n).
Unfortunately a multiplier is always greater than 1, because of that input arguments have to be modified (in step 2) in order to get the correct results.
The SSE2 procedure contains following steps:
+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+ |....dddcccbbbaaa|....dddcccbbbaaa|....dddcccbbbaaa|....dddcccbbbaaa|....ggghhhfffeee|....ggghhhfffeee|....ggghhhfffeee|....ggghhhfffeee| +----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+
+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+ |...dddcccbbbaaa.|...dddcccbbbaaa.|...dddcccbbbaaa.|...dddcccbbbaaa.|...ggghhhfffeee.|...ggghhhfffeee.|...ggghhhfffeee.|...ggghhhfffeee.| +----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+
+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+ |...ddd..........|......ccc.......|.........bbb....|............aaa.|...ggg..........|......hhh.......|.........fff....|............eee.| +----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+
Parallel shift right (PMULHUW):
Multipliers are (respectively):
+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+ |.............ddd|.............ccc|.............bbb|.............aaa|.............ggg|.............hhh|.............fff|.............eee| +----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+
convert | time [s] | speedup | |
---|---|---|---|
naive | 0.2296 | 1.00 | ██████████████████ |
mul | 0.3846 | 0.60 | ███████████ |
SSE2 | 0.2629 | 0.87 | ████████████████ |
1 LUT | 0.1122 | 2.05 | ██████████████████████████████████████ |
2 LUTs | 0.1286 | 1.79 | █████████████████████████████████ |
Notes:
Statistics from 20 samples.
convert | time [s] | speedup (avg) | |||
---|---|---|---|---|---|
min | max | avg | |||
naive | 0.1630 | 0.2040 | 0.1841 | 1.00 | ████████████████████ |
mul | 0.1630 | 0.2200 | 0.1777 | 1.04 | ████████████████████ |
SSE2 | 0.2170 | 0.2570 | 0.2318 | 0.79 | ███████████████ |
pdep | 0.1360 | 0.1650 | 0.1491 | 1.23 | ████████████████████████ |
1 LUT | 0.1350 | 0.1680 | 0.1452 | 1.27 | █████████████████████████ |
2 LUTs | 0.1350 | 0.1770 | 0.1468 | 1.25 | █████████████████████████ |
Notes:
Full implementation and program used in experiments are available on github.