Author: | Wojciech Muła |
---|---|
Added on: | 2014-10-15 |
Updated on: | 2018-03-22 (added SSSE3 variant) |
Contents
SSE procedure is able to convert two 8-digit numbers. The main instruction used in converting to numeric value is PMADDWD.
This instruction is specialized and I guess isn't often used. PMADDWD operates on packed words and performs following algorithm:
temp = array of 8 signed dwords -- 1. multiply 16-bit signed numbers vertically src = packed_word(...) dst = packed_word(...) for i in 0 .. 7 loop -- temp is 32-bit signed number temp[i] := signed_multiplication(src[i], dst[i]) end loop -- 2. add adjacent 32-bit words of temp and save result to src -- now src has type packed_dword for i in 0 .. 3 loop src[i] = temp[2 * i] + temp[2 * i + 1] end loop
Load 16 ASCII digits in order abcdefghijklmnop, and convert to bytes. Each byte a..p is in range 0..9:
t1 = packed_byte(| p | o | n | m | l | k | j | i | h | g | f | e | d | c | b | a |)) digit weight 10^0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 10^15
Split even and odd digits:
t2_even = packed_byte(| 0 | o | 0 | m | 0 | k | 0 | i | 0 | g | 0 | e | 0 | c | 0 | a |)) t2_odd = packed_byte(| 0 | p | 0 | n | 0 | l | 0 | j | 0 | h | 0 | f | 0 | d | 0 | b |))
Calculate partial values of even digits:
t3_even = pmaddwd(t2_even, {1000, 10, 1000, 10, 1000, 10, 1000, 10}) = packed_dword(| 1000*m + 10*o | 1009*i + 10*k | 1000*e + 10*g | 1000*a + 10*c |)
as 4-digit decimals:
t3_even = packed_dword(| m0o0 | i0k0 | e0g0 | a0c0 |)
Similarly calculate partial values of odd digit:
t4_odd = pmaddwd(t2_odd, {100, 1, 100, 1, 100, 1, 100, 1}) = packed_dword(| 100*n + p | 100*j + l | 100*f + h | 100*b + d |)
as 4-digit decimals:
t4_odd = packed_dword(| 0n0p | 0j0l | 0f0h | 0b0d |)
Add partial results from step 3 and 4 (shown as decimals):
t5 = t3_even + t4_odd = = packed_dword(| m0o0 | i0k0 | e0g0 | a0c0 |) + packed_dword(| 0n0p | 0j0l | 0f0h | 0b0d |) = packed_dword(| mnop | ijkl | efgh | abcd |)
Last step is to form 8-digits partial results. Since numbers are less than 10000, higher words of t5 are zero:
t5 = packed_dword(| 0000 | mnop | 0000 | ijkl | 0000 | efgh | 0000 | abcd |) t6 = t5 | (t5 >> 16) = packed_dword(| 0000 | mnop | mnop | ijkl | ijkl | efgh | efgh | abcd |) ^^^^^^^^^^^ ^^^^^^^^^^^ these dwords are processed later
Calculate results:
t7 = pmaddwd(t6, {10000, 1, 0, 0, 10000, 1, 0, 0}) = packed_dword(| 0 | ijklmnop | 0 | abcdefgh |)
64-bit multiplication is available in SSE starting from SSE4.1, so portable way to get final 64-bit value requires a CPU multiplication.
SSSE3 introduced instruction PMADDUBSW that works similarly to PMADDWD but operates on packed bytes. The instruction treats numbers in one vector as signed and, in another vector, as unsigned. This limits a little bit usage, but fortunately not in this case.
Algorithm for SSSE3 can convert 16-digit numbers and works as follows:
Load 16 ASCII digits in order abcdefghijklmnop, and convert to bytes. Each byte a..p is in range 0..9:
t0 = packed_byte (| p | o | n | m | l | k | j | i | h | g | f | e | d | c | b | a |)) digit weight 10^0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 10^15
Calculate partial values of adjacent digits using PMADDUBSW. After this step each 16-bit element contains a 2-digit number:
t1 = pmaddubsw(t0, {10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1}) = packed_word (| po | nm | lk | ji | hg | fe | dc | ba |)
Calculate partial values from adjacent 2-digit numbers using PMADDWD. After this step each 32-bit element contains a 4-digit number. Please note that a 32-bit number can hold an 8-digit decimal number.:
t2 = pmaddwd(t1, {100, 1, 100, 1, 100, 1, 100, 1}) = packed_dword(| ponm | lkji | hgfe | dcba |)
Convert into vector of words (16-bit words can hold 4-digit numbers):
t3 = packusdw(t2, t2) = packed_word (| ponm | lkji | hgfe | dcba | ponm | lkji | hgfe | dcba |)
Build 8-digit numbers using PMADDWD:
t4 = pmaddwd(t3, {10000, 1, 10000, 1}) = packed_dword(| ponmlkji | hgfedcba | ponmlkji | hgfedcba |)
Get two lowest dwords from t4 and convert into 16-digit number using scalar instructions.
In SSE4.1 conversion of 32-digit numbers is very easy. String is split to 16-digit substrings and each substring is processed in steps 1 .. 5 of the algorithm. After 5th step we have two values:
t5_lo = packed_dword(| 0000 | mnop | 0000 | ijkl | 0000 | efgh | 0000 | abcd |) t5_hi = packed_dword(| 0000 | MNOP | 0000 | IJKL | 0000 | EFGH | 0000 | ABCD |)
Steps 6 and 7 are different:
SSE4.1 supports PACKUSDW, thanks to that t5_lo and t5_hi could be joined with single instruction:
t6 = packusdw(t5_hi, t5_lo) = packed_word(| MNOP | IJKL | EFGH | ABCD | mnop | ijkl | efgh | abcd |)
Calculate results:
t7 = pmaddwd(t6, {10000, 1, 10000, 1, 10000, 1, 10000, 1}) = packed_dword(| IJKLMNOP | ABCDEFGH | ijklmnop | abcdefgh |)
SSE4.1 supports also PMULDQ, so producing 64-bit values in SSE registers is possible:
m1 = pmuldq(t7, {100000000, 100000000} = packed_qword(| ABCDEFGH00000000 | abcdefgh00000000 |) m2 = packed_dword(| 0 | t7[3] | 0 | t7[1] |) = packed_qword(| 0 | IJKLMNOP | 0 | ijklmnop |) t8 = m1 + m2 = packed_qword(| ABCDEFGHIJKLMNOP | abcdefgijklmnop |)
However, calculating 128-bit value require CPU instructions.
Sample code is available on gitub. Only SSE and SSSE3 implementations are available.
The SWAR methods are described in the article linked at the bottom.
method | time [us] | speedup |
---|---|---|
naive | 247514 | 1.00 |
SWAR (1) | 158043 | 1.57 |
SWAR (2) | 148712 | 1.66 |
SWAR (3) | 125978 | 1.96 |
SSE | 37704 | 6.56 |
SSSE3 | 27271 | 9.07 |
method | time [us] | speedup |
---|---|---|
naive | 131904 | 1.00 |
SWAR (1) | 84933 | 1.55 |
SWAR (2) | 84945 | 1.55 |
SWAR (3) | 75862 | 1.74 |
SSE | 18698 | 7.05 |
SSSE3 | 17693 | 7.45 |