Parsing decimal numbers — part 2: SSE

Author:Wojciech Muła
Added on:2014-10-15
Updated on:2018-03-22 (added SSSE3 variant)

Contents

SSE procedure is able to convert two 8-digit numbers. The main instruction used in converting to numeric value is PMADDWD.

PMADDWD details

This instruction is specialized and I guess isn't often used. PMADDWD operates on packed words and performs following algorithm:

temp = array of 8 signed dwords

-- 1. multiply 16-bit signed numbers vertically
src = packed_word(...)
dst = packed_word(...)

for i in 0 .. 7 loop

    -- temp is 32-bit signed number

    temp[i] := signed_multiplication(src[i], dst[i])

end loop

-- 2. add adjacent 32-bit words of temp and save result to src
--    now src has type packed_dword
for i in 0 .. 3 loop

    src[i] = temp[2 * i] + temp[2 * i + 1]

end loop

Algorithm

  1. Load 16 ASCII digits in order abcdefghijklmnop, and convert to bytes. Each byte a..p is in range 0..9:

    t1 = packed_byte(| p | o | n | m | l | k | j | i | h | g | f | e | d | c | b | a |))
    
       digit weight  10^0  . . . . . . . . . . . . . . . . . . . . . . . . . . . 10^15
    
  2. Split even and odd digits:

    t2_even = packed_byte(| 0 | o | 0 | m | 0 | k | 0 | i | 0 | g | 0 | e | 0 | c | 0 | a |))
    t2_odd  = packed_byte(| 0 | p | 0 | n | 0 | l | 0 | j | 0 | h | 0 | f | 0 | d | 0 | b |))
    
  3. Calculate partial values of even digits:

    t3_even = pmaddwd(t2_even, {1000, 10, 1000, 10, 1000, 10, 1000, 10})
            = packed_dword(| 1000*m + 10*o | 1009*i + 10*k | 1000*e + 10*g | 1000*a + 10*c |)
    

    as 4-digit decimals:

    t3_even = packed_dword(| m0o0 | i0k0 | e0g0 | a0c0 |)
    
  4. Similarly calculate partial values of odd digit:

    t4_odd  = pmaddwd(t2_odd,  {100, 1, 100, 1, 100, 1, 100, 1})
            = packed_dword(| 100*n + p | 100*j + l | 100*f + h | 100*b + d |)
    

    as 4-digit decimals:

    t4_odd  = packed_dword(| 0n0p | 0j0l | 0f0h | 0b0d |)
    
  5. Add partial results from step 3 and 4 (shown as decimals):

    t5 = t3_even + t4_odd =
    
       = packed_dword(| m0o0 | i0k0 | e0g0 | a0c0 |)
       + packed_dword(| 0n0p | 0j0l | 0f0h | 0b0d |)
    
       = packed_dword(| mnop | ijkl | efgh | abcd |)
    
  6. Last step is to form 8-digits partial results. Since numbers are less than 10000, higher words of t5 are zero:

    t5 = packed_dword(| 0000 | mnop | 0000 | ijkl | 0000 | efgh | 0000 | abcd |)
    
    t6 = t5 | (t5 >> 16)
       = packed_dword(| 0000 | mnop | mnop | ijkl | ijkl | efgh | efgh | abcd |)
                                      ^^^^^^^^^^^                 ^^^^^^^^^^^
                                         these dwords are processed later
    
  7. Calculate results:

    t7 = pmaddwd(t6, {10000, 1, 0, 0, 10000, 1, 0, 0})
       = packed_dword(|      0      |   ijklmnop  |      0      |   abcdefgh  |)
    

64-bit multiplication is available in SSE starting from SSE4.1, so portable way to get final 64-bit value requires a CPU multiplication.

SSSE3 new

SSSE3 introduced instruction PMADDUBSW that works similarly to PMADDWD but operates on packed bytes. The instruction treats numbers in one vector as signed and, in another vector, as unsigned. This limits a little bit usage, but fortunately not in this case.

Algorithm for SSSE3 can convert 16-digit numbers and works as follows:

  1. Load 16 ASCII digits in order abcdefghijklmnop, and convert to bytes. Each byte a..p is in range 0..9:

    t0 = packed_byte (| p | o | n | m | l | k | j | i | h | g | f | e | d | c | b | a |))
    
        digit weight   10^0  . . . . . . . . . . . . . . . . . . . . . . . . . . . 10^15
    
  2. Calculate partial values of adjacent digits using PMADDUBSW. After this step each 16-bit element contains a 2-digit number:

    t1 = pmaddubsw(t0, {10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1})
       = packed_word (|   po  |   nm  |   lk  |   ji  |   hg  |   fe  |   dc  |   ba  |)
    
  3. Calculate partial values from adjacent 2-digit numbers using PMADDWD. After this step each 32-bit element contains a 4-digit number. Please note that a 32-bit number can hold an 8-digit decimal number.:

    t2 = pmaddwd(t1, {100, 1, 100, 1, 100, 1, 100, 1})
       = packed_dword(|      ponm     |      lkji     |      hgfe     |      dcba     |)
    
  4. Convert into vector of words (16-bit words can hold 4-digit numbers):

    t3 = packusdw(t2, t2)
       = packed_word (|  ponm |  lkji |  hgfe |  dcba |  ponm |  lkji |  hgfe |  dcba |)
    
  5. Build 8-digit numbers using PMADDWD:

    t4 = pmaddwd(t3, {10000, 1, 10000, 1})
       = packed_dword(|    ponmlkji   |    hgfedcba   |    ponmlkji   |    hgfedcba   |)
    
  6. Get two lowest dwords from t4 and convert into 16-digit number using scalar instructions.

SSE4.1

In SSE4.1 conversion of 32-digit numbers is very easy. String is split to 16-digit substrings and each substring is processed in steps 1 .. 5 of the algorithm. After 5th step we have two values:

t5_lo = packed_dword(| 0000 | mnop | 0000 | ijkl | 0000 | efgh | 0000 | abcd |)
t5_hi = packed_dword(| 0000 | MNOP | 0000 | IJKL | 0000 | EFGH | 0000 | ABCD |)

Steps 6 and 7 are different:

  1. SSE4.1 supports PACKUSDW, thanks to that t5_lo and t5_hi could be joined with single instruction:

    t6 = packusdw(t5_hi, t5_lo)
       = packed_word(| MNOP | IJKL | EFGH | ABCD | mnop | ijkl | efgh | abcd |)
    
  2. Calculate results:

    t7 = pmaddwd(t6, {10000, 1, 10000, 1, 10000, 1, 10000, 1})
       = packed_dword(|  IJKLMNOP  |   ABCDEFGH  |  ijklmnop   |  abcdefgh   |)
    
  3. SSE4.1 supports also PMULDQ, so producing 64-bit values in SSE registers is possible:

    m1 = pmuldq(t7, {100000000, 100000000}
       = packed_qword(|     ABCDEFGH00000000     |       abcdefgh00000000    |)
    
    m2 = packed_dword(|     0      |    t7[3]    |      0      |    t7[1]    |)
       = packed_qword(|     0      |   IJKLMNOP  |      0      |   ijklmnop  |)
    
    t8 = m1 + m2
       = packed_qword(|     ABCDEFGHIJKLMNOP     |       abcdefgijklmnop     |)
    

However, calculating 128-bit value require CPU instructions.

Sample code

Sample code is available on gitub. Only SSE and SSSE3 implementations are available.

Experiments results new

The SWAR methods are described in the article linked at the bottom.

Core i5 M540 @ 2.53GHz

method time [us] speedup
naive 247514 1.00
SWAR (1) 158043 1.57
SWAR (2) 148712 1.66
SWAR (3) 125978 1.96
SSE 37704 6.56
SSSE3 27271 9.07

Skylake Core i7-6700 CPU @ 3.40GHz

method time [us] speedup
naive 131904 1.00
SWAR (1) 84933 1.55
SWAR (2) 84945 1.55
SWAR (3) 75862 1.74
SSE 18698 7.05
SSSE3 17693 7.45

See also