Parsing decimal numbers — part 2: SSE

Author:	Wojciech Muła
Added on:	2014-10-15
Updated on:	2018-03-22 (added SSSE3 variant)

Contents

PMADDWD details
Algorithm
SSSE3
SSE4.1
Sample code
Experiments results
- Core i5 M540 @ 2.53GHz
- Skylake Core i7-6700 CPU @ 3.40GHz
See also

SSE procedure is able to convert two 8-digit numbers. The main instruction used in converting to numeric value is PMADDWD.

PMADDWD details

This instruction is specialized and I guess isn't often used. PMADDWD operates on packed words and performs following algorithm:

temp = array of 8 signed dwords

-- 1. multiply 16-bit signed numbers vertically
src = packed_word(...)
dst = packed_word(...)

for i in 0 .. 7 loop

    -- temp is 32-bit signed number

    temp[i] := signed_multiplication(src[i], dst[i])

end loop

-- 2. add adjacent 32-bit words of temp and save result to src
--    now src has type packed_dword
for i in 0 .. 3 loop

    src[i] = temp[2 * i] + temp[2 * i + 1]

end loop

Algorithm

Load 16 ASCII digits in order abcdefghijklmnop, and convert to bytes. Each byte a..p is in range 0..9:

t1 = packed_byte(| p | o | n | m | l | k | j | i | h | g | f | e | d | c | b | a |))

   digit weight  10^0  . . . . . . . . . . . . . . . . . . . . . . . . . . . 10^15

Split even and odd digits:

t2_even = packed_byte(| 0 | o | 0 | m | 0 | k | 0 | i | 0 | g | 0 | e | 0 | c | 0 | a |))
t2_odd  = packed_byte(| 0 | p | 0 | n | 0 | l | 0 | j | 0 | h | 0 | f | 0 | d | 0 | b |))

Calculate partial values of even digits:

t3_even = pmaddwd(t2_even, {1000, 10, 1000, 10, 1000, 10, 1000, 10})
        = packed_dword(| 1000*m + 10*o | 1009*i + 10*k | 1000*e + 10*g | 1000*a + 10*c |)

as 4-digit decimals:

t3_even = packed_dword(| m0o0 | i0k0 | e0g0 | a0c0 |)

Similarly calculate partial values of odd digit:

t4_odd  = pmaddwd(t2_odd,  {100, 1, 100, 1, 100, 1, 100, 1})
        = packed_dword(| 100*n + p | 100*j + l | 100*f + h | 100*b + d |)

as 4-digit decimals:

t4_odd  = packed_dword(| 0n0p | 0j0l | 0f0h | 0b0d |)

Add partial results from step 3 and 4 (shown as decimals):

t5 = t3_even + t4_odd =

   = packed_dword(| m0o0 | i0k0 | e0g0 | a0c0 |)
   + packed_dword(| 0n0p | 0j0l | 0f0h | 0b0d |)

   = packed_dword(| mnop | ijkl | efgh | abcd |)

Last step is to form 8-digits partial results. Since numbers are less than 10000, higher words of t5 are zero:

t5 = packed_dword(| 0000 | mnop | 0000 | ijkl | 0000 | efgh | 0000 | abcd |)

t6 = t5 | (t5 >> 16)
   = packed_dword(| 0000 | mnop | mnop | ijkl | ijkl | efgh | efgh | abcd |)
                                  ^^^^^^^^^^^                 ^^^^^^^^^^^
                                     these dwords are processed later

Calculate results:

t7 = pmaddwd(t6, {10000, 1, 0, 0, 10000, 1, 0, 0})
   = packed_dword(|      0      |   ijklmnop  |      0      |   abcdefgh  |)

64-bit multiplication is available in SSE starting from SSE4.1, so portable way to get final 64-bit value requires a CPU multiplication.

SSSE3

SSSE3 introduced instruction PMADDUBSW that works similarly to PMADDWD but operates on packed bytes. The instruction treats numbers in one vector as signed and, in another vector, as unsigned. This limits a little bit usage, but fortunately not in this case.

Algorithm for SSSE3 can convert 16-digit numbers and works as follows:

Load 16 ASCII digits in order abcdefghijklmnop, and convert to bytes. Each byte a..p is in range 0..9:

t0 = packed_byte (| p | o | n | m | l | k | j | i | h | g | f | e | d | c | b | a |))

    digit weight   10^0  . . . . . . . . . . . . . . . . . . . . . . . . . . . 10^15

Calculate partial values of adjacent digits using PMADDUBSW. After this step each 16-bit element contains a 2-digit number:

t1 = pmaddubsw(t0, {10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1})
   = packed_word (|   po  |   nm  |   lk  |   ji  |   hg  |   fe  |   dc  |   ba  |)

Calculate partial values from adjacent 2-digit numbers using PMADDWD. After this step each 32-bit element contains a 4-digit number. Please note that a 32-bit number can hold an 8-digit decimal number.:
```
t2 = pmaddwd(t1, {100, 1, 100, 1, 100, 1, 100, 1})
   = packed_dword(|      ponm     |      lkji     |      hgfe     |      dcba     |)
```

Convert into vector of words (16-bit words can hold 4-digit numbers):

t3 = packusdw(t2, t2)
   = packed_word (|  ponm |  lkji |  hgfe |  dcba |  ponm |  lkji |  hgfe |  dcba |)

Build 8-digit numbers using PMADDWD:

t4 = pmaddwd(t3, {10000, 1, 10000, 1})
   = packed_dword(|    ponmlkji   |    hgfedcba   |    ponmlkji   |    hgfedcba   |)

Get two lowest dwords from t4 and convert into 16-digit number using scalar instructions.

SSE4.1

In SSE4.1 conversion of 32-digit numbers is very easy. String is split to 16-digit substrings and each substring is processed in steps 1 .. 5 of the algorithm. After 5th step we have two values:

t5_lo = packed_dword(| 0000 | mnop | 0000 | ijkl | 0000 | efgh | 0000 | abcd |)
t5_hi = packed_dword(| 0000 | MNOP | 0000 | IJKL | 0000 | EFGH | 0000 | ABCD |)

Steps 6 and 7 are different:

SSE4.1 supports PACKUSDW, thanks to that t5_lo and t5_hi could be joined with single instruction:

t6 = packusdw(t5_hi, t5_lo)
   = packed_word(| MNOP | IJKL | EFGH | ABCD | mnop | ijkl | efgh | abcd |)

Calculate results:

t7 = pmaddwd(t6, {10000, 1, 10000, 1, 10000, 1, 10000, 1})
   = packed_dword(|  IJKLMNOP  |   ABCDEFGH  |  ijklmnop   |  abcdefgh   |)

SSE4.1 supports also PMULDQ, so producing 64-bit values in SSE registers is possible:

m1 = pmuldq(t7, {100000000, 100000000}
   = packed_qword(|     ABCDEFGH00000000     |       abcdefgh00000000    |)

m2 = packed_dword(|     0      |    t7[3]    |      0      |    t7[1]    |)
   = packed_qword(|     0      |   IJKLMNOP  |      0      |   ijklmnop  |)

t8 = m1 + m2
   = packed_qword(|     ABCDEFGHIJKLMNOP     |       abcdefgijklmnop     |)

However, calculating 128-bit value require CPU instructions.

Sample code

Sample code is available on gitub. Only SSE and SSSE3 implementations are available.

Experiments results

The SWAR methods are described in the article linked at the bottom.

Core i5 M540 @ 2.53GHz

method	time [us]	speedup
naive	247514	1.00
SWAR (1)	158043	1.57
SWAR (2)	148712	1.66
SWAR (3)	125978	1.96
SSE	37704	6.56
SSSE3	27271	9.07

Skylake Core i7-6700 CPU @ 3.40GHz

method	time [us]	speedup
naive	131904	1.00
SWAR (1)	84933	1.55
SWAR (2)	84945	1.55
SWAR (3)	75862	1.74
SSE	18698	7.05
SSSE3	17693	7.45