Author: | Wojciech Muła |
---|---|
Added on: | 2014-10-12 |
Contents
Naive parsing of unsigned decimal numbers can be coded as:
uint32_t parse(char* s) { uint32_t result = 0; while (*s) { const uint8_t digit = *s++; result = result * 10 + (digit - '0'); } return result; }
The procedure parse does not check validity of a string nor check for overflow — these problems won't be discussed in this text.
Processing single letter require just 3 operations:
On x86 multiplication by 10 is cheap; multiplication x * 10 is equivalent to x << 3 + x << 1. Shift by 8 is coded with LEA (it's executed by addressing unit, not ALU), and shift by 1 is simple addition or another LEA.
In SWAR version we'll process 4 letters at once. For shorter strings naive fallback have to be used or proper mask applied on dword.
The common phase of all SWAR versions is converting ASCII to digit values, this is simple subtraction.
input — 4 ASCII letters:
x = packed_byte(| '1' | '2' | '7' | '0' |)
subtract vector of '0':
x' = packed_byte(| '1' | '2' | '7' | '0' |) - packed_byte(| '0' | '0' | '0' | '0' |) = packed_byte(| 1 | 2 | 7 | 0 |)
Suppose we have just two digits x and y stored in separate bytes:
w = packed_byte(| 0 | 0 | x | y |)
Value of w is 256*x + y, but we want to get 10*y + x. By multiplication w by 2561 we get this value in the higher byte. Magic? No, 2561 = 256 * 10 + 1, and:
Proof: (256 ⋅ x + y) ⋅ (1 + 10256) = (256 ⋅ x + y) + (655360 ⋅ x + 256 ⋅ 10 ⋅ y). Now reject components less than 256 and greater than 65535: (x ⋅ 256) + (y ⋅ 2560) = 256 ⋅ (10 ⋅ y + x).
Full expression:
v = ((w * 2561) >> 8) & 0xff
The same trick could be used to join 2-digit values ab & cd stored on 1st and 3rd byte:
Z = packed_byte(| 0 | ab | 0 | cd |)
The "magic" multiplier is 6553601, result is 100*cd + ab saved in the higher word.
input — 4 digits:
t1 = packed_byte(| d | c | b | a |)
form two words suitable to make 2-digits values:
t2a = packed_byte(| 0 | 0 | b | a |) t2b = packed_byte(| 0 | 0 | d | c |)
form 2-digits values by multiplication t2{a,b} by 2561:
t3a = packed_byte(| ? | ? | 10*a + b | ? |) t3b = packed_byte(| ? | ? | 10*c + d | ? |)
prepare dword from t3{a,b} to get 4-digits value:
t4 = packed_byte(| 0 | 10*a + b | 0 | 10*c + d |)
form 4-digit value by multiplication t3 by 6553601:
t5 = packed_word(| 1000*c + 100* d + 10*a + b | ? |)
extract higher word from t5.
#define packed32(b) (uint8_t(b) * 0x01010101) uint32_t parse1(char* s) { const uint32_t input = *reinterpret_cast<uint32_t*>(s); const uint32_t t1 = input - packed32('0'); const uint32_t t2a = t & 0xffff; const uint32_t t2b = t >> 16; const uint32_t t3a = ((t2a * (2561)) >> 8) & 0xff; const uint32_t t3b = ((t2b * (2561)) >> 8) & 0xff; const uint32_t t4 = t3a | (t3b << 16); return (word * (6553601) >> 16); }
Another approach to making 2-digit values:
input — 4 digits:
t1 = | d | c | b | a |
mask 1st and 3rd digit:
t2 = | d | 0 | b | 0 |
move one digit left and multiply by 10:
t3 = | 0 | d*10 | 0 | b*10 |
This can be done without multiplication. We calculate (x * 10) >> 8, multiplication is replaced by shifts ((x << 3) + (x << 1)) >> 8, after simplify shifts: (x >> (8-3)) + (x >> (8-1)).
add t1 and t3:
t4 = | d | c + d*10 | b | a + b*10 |
mask 1st & 3rd digits again:
t5 = | 0 | c + d*10 | 0 | a + b*10 |
convert t5 to 4-digits value.
uint32_t parse2(char* s) { const uint32_t input = bswap(*reinterpret_cast<uint32_t*>(s)); const uint32_t t1 = input - packed32('0'); const uint32_t t2 = t1 & 0xff00ff00; const uint32_t t3 = (t2 >> (8-1)) + (t2 >> (8-3)); const uint32_t t4 = t1 + t3; const uint32_t t5 = t4 & 0x00ff00ff; // reversed order, so multiplier is diffrent return (t5 * (100 + 65536)) >> 16; }
Another approach to making 2-digit values:
input — 4 digits:
t1 = | d | c | b | a |
multiply by 10:
t2 = | d*10 | c*10 | b*10 | a*10 |
move digits one position right (shift dword right by 8 bits):
t3 = | 0 | d*10 | c*10 | b*10 |
add t2 and t3:
t4 = | d*10 | c + d*10 | b + c*10 | a + b*10 |
mask unneeded bytes 1 and 3:
t5 = | 0 | c + d*10 | 0 | a + b*10 |
Now word t5 could be converted as in the version 2.
uint32_t parse3(char* s) { const uint32_t input = bswap(*reinterpret_cast<uint32_t*>(s)); const uint32_t t1 = input - packed32('0'); const uint32_t t2 = (t1 * 10) >> 8; const uint32_t t3 = t1 + t2; const uint32_t t4 = t3 & 0x00ff00ff; return (t4 * (100 + 65536)) >> 16; }
Number of operations required to convert 4 ASCII digits.
operation | naive | version 1 | version 2 | version 3 |
---|---|---|---|---|
bswap | 1 | 1 | ||
bit-or | 1 | |||
bit-and | 3 | 2 | 1 | |
shift | 5 | 3 | 2 | |
add | 4 | 2 | 1 | |
sub | 1 | 1 | 1 | 1 |
mult | 3 | 1 | 1 | |
mult-by-10 | 4 | 1 | ||
total | 9 | 13 | 10 | 6 |
Sample code is available on gitub.