Author: | Wojciech Muła |
---|---|

Added on: | 2014-10-15 |

Updated on: | 2018-03-22 (added SSSE3 variant) |

Contents

SSE procedure is able to convert two 8-digit numbers. The main
instruction used in converting to numeric value is `PMADDWD`.

This instruction is specialized and I guess isn't often used.
`PMADDWD` operates on **packed words** and performs following
algorithm:

temp = array of 8 signed dwords -- 1. multiply 16-bit signed numbers vertically src = packed_word(...) dst = packed_word(...) for i in 0 .. 7 loop -- temp is 32-bit signed number temp[i] := signed_multiplication(src[i], dst[i]) end loop -- 2. add adjacent 32-bit words of temp and save result to src -- now src has type packed_dword for i in 0 .. 3 loop src[i] = temp[2 * i] + temp[2 * i + 1] end loop

Load 16 ASCII digits in order abcdefghijklmnop, and convert to bytes. Each byte a..p is in range 0..9:

t1 = packed_byte(| p | o | n | m | l | k | j | i | h | g | f | e | d | c | b | a |)) digit weight 10^0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 10^15

Split even and odd digits:

t2_even = packed_byte(| 0 | o | 0 | m | 0 | k | 0 | i | 0 | g | 0 | e | 0 | c | 0 | a |)) t2_odd = packed_byte(| 0 | p | 0 | n | 0 | l | 0 | j | 0 | h | 0 | f | 0 | d | 0 | b |))

Calculate partial values of even digits:

t3_even = pmaddwd(t2_even, {1000, 10, 1000, 10, 1000, 10, 1000, 10}) = packed_dword(| 1000*m + 10*o | 1009*i + 10*k | 1000*e + 10*g | 1000*a + 10*c |)

as 4-digit decimals:

t3_even = packed_dword(| m0o0 | i0k0 | e0g0 | a0c0 |)

Similarly calculate partial values of odd digit:

t4_odd = pmaddwd(t2_odd, {100, 1, 100, 1, 100, 1, 100, 1}) = packed_dword(| 100*n + p | 100*j + l | 100*f + h | 100*b + d |)

as 4-digit decimals:

t4_odd = packed_dword(| 0n0p | 0j0l | 0f0h | 0b0d |)

Add partial results from step 3 and 4 (shown as decimals):

t5 = t3_even + t4_odd = = packed_dword(| m0o0 | i0k0 | e0g0 | a0c0 |) + packed_dword(| 0n0p | 0j0l | 0f0h | 0b0d |) = packed_dword(| mnop | ijkl | efgh | abcd |)

Last step is to form 8-digits partial results. Since numbers are less than 10000, higher words of

`t5`are zero:t5 = packed_dword(| 0000 | mnop | 0000 | ijkl | 0000 | efgh | 0000 | abcd |) t6 = t5 | (t5 >> 16) = packed_dword(| 0000 | mnop | mnop | ijkl | ijkl | efgh | efgh | abcd |) ^^^^^^^^^^^ ^^^^^^^^^^^ these dwords are processed later

Calculate results:

t7 = pmaddwd(t6, {10000, 1, 0, 0, 10000, 1, 0, 0}) = packed_dword(| 0 | ijklmnop | 0 | abcdefgh |)

64-bit multiplication is available in SSE starting from SSE4.1, so portable way to get final 64-bit value requires a CPU multiplication.

SSSE3 introduced instruction `PMADDUBSW` that works similarly to `PMADDWD`
but operates on **packed bytes**. The instruction treats numbers in one vector
as signed and, in another vector, as unsigned. This limits a little bit usage,
but fortunately not in this case.

Algorithm for SSSE3 can convert 16-digit numbers and works as follows:

Load 16 ASCII digits in order abcdefghijklmnop, and convert to bytes. Each byte a..p is in range 0..9:

t0 = packed_byte (| p | o | n | m | l | k | j | i | h | g | f | e | d | c | b | a |)) digit weight 10^0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 10^15

Calculate partial values of adjacent digits using

`PMADDUBSW`. After this step each 16-bit element contains a 2-digit number:t1 = pmaddubsw(t0, {10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1}) = packed_word (| po | nm | lk | ji | hg | fe | dc | ba |)

Calculate partial values from adjacent 2-digit numbers using

`PMADDWD`. After this step each 32-bit element contains a 4-digit number. Please note that a 32-bit number can hold an 8-digit decimal number.:t2 = pmaddwd(t1, {100, 1, 100, 1, 100, 1, 100, 1}) = packed_dword(| ponm | lkji | hgfe | dcba |)

Convert into vector of words (16-bit words can hold 4-digit numbers):

t3 = packusdw(t2, t2) = packed_word (| ponm | lkji | hgfe | dcba | ponm | lkji | hgfe | dcba |)

Build 8-digit numbers using

`PMADDWD`:t4 = pmaddwd(t3, {10000, 1, 10000, 1}) = packed_dword(| ponmlkji | hgfedcba | ponmlkji | hgfedcba |)

Get two lowest dwords from

`t4`and convert into 16-digit number using scalar instructions.

In SSE4.1 conversion of **32-digit** numbers is very easy. String is split to
16-digit substrings and each substring is processed in steps 1 .. 5 of the
algorithm. After 5th step we have two values:

t5_lo = packed_dword(| 0000 | mnop | 0000 | ijkl | 0000 | efgh | 0000 | abcd |) t5_hi = packed_dword(| 0000 | MNOP | 0000 | IJKL | 0000 | EFGH | 0000 | ABCD |)

Steps 6 and 7 are different:

SSE4.1 supports

`PACKUSDW`, thanks to that`t5_lo`and`t5_hi`could be joined with single instruction:t6 = packusdw(t5_hi, t5_lo) = packed_word(| MNOP | IJKL | EFGH | ABCD | mnop | ijkl | efgh | abcd |)

Calculate results:

t7 = pmaddwd(t6, {10000, 1, 10000, 1, 10000, 1, 10000, 1}) = packed_dword(| IJKLMNOP | ABCDEFGH | ijklmnop | abcdefgh |)

SSE4.1 supports also

`PMULDQ`, so producing 64-bit values in SSE registers is possible:m1 = pmuldq(t7, {100000000, 100000000} = packed_qword(| ABCDEFGH00000000 | abcdefgh00000000 |) m2 = packed_dword(| 0 | t7[3] | 0 | t7[1] |) = packed_qword(| 0 | IJKLMNOP | 0 | ijklmnop |) t8 = m1 + m2 = packed_qword(| ABCDEFGHIJKLMNOP | abcdefgijklmnop |)

However, calculating 128-bit value require CPU instructions.

Sample code is available on gitub. Only SSE and SSSE3 implementations are available.

The SWAR methods are described in the article linked at the bottom.

method | time [us] | speedup |
---|---|---|

naive | 247514 | 1.00 |

SWAR (1) | 158043 | 1.57 |

SWAR (2) | 148712 | 1.66 |

SWAR (3) | 125978 | 1.96 |

SSE | 37704 | 6.56 |

SSSE3 | 27271 | 9.07 |

method | time [us] | speedup |
---|---|---|

naive | 131904 | 1.00 |

SWAR (1) | 84933 | 1.55 |

SWAR (2) | 84945 | 1.55 |

SWAR (3) | 75862 | 1.74 |

SSE | 18698 | 7.05 |

SSSE3 | 17693 | 7.45 |