Parsing decimal numbers — part 2: SSE

Author:Wojciech Muła
Added on:2014-10-15

Contents

SSE procedure is able to convert two 8-digit numbers. The main instruction used in converting to numeric value is PMADDWD.

PMADDWD details

This instruction is specialized and I guess isn't often used. PMADDWD operates on packed words and performs following algorithm:

temp = array of 8 signed dwords

-- 1. multiply 16-bit signed numbers vertically
src = packed_word(...)
dst = packed_word(...)

for i in 0 .. 7 loop

    -- temp is 32-bit signed number

    temp[i] := signed_multiplication(src[i], dst[i])

end loop

-- 2. add adjacent 32-bit words of temp and save result to src
--    now src has type packed_dword
for i in 0 .. 3 loop

    src[i] = temp[2 * i] + temp[2 * i + 1]

end loop

Algorithm

  1. Load 16 ASCII digits in order abcdefghijklmnop, and convert to bytes. Each byte a..p is in range 0..9:

    t1 = packed_byte(| p | o | n | m | l | k | j | i | h | g | f | e | d | c | b | a |))
    
       digit weight  10^0  . . . . . . . . . . . . . . . . . . . . . . . . . . . 10^16
    
  2. Split even and odd digits:

    t2_even = packed_byte(| 0 | o | 0 | m | 0 | k | 0 | i | 0 | g | 0 | e | 0 | c | 0 | a |))
    t2_odd  = packed_byte(| 0 | p | 0 | n | 0 | l | 0 | j | 0 | h | 0 | f | 0 | d | 0 | b |))
    
  3. Calculate partial values of even digits:

    t3_even = pmaddwd(t2_even, {1000, 10, 1000, 10, 1000, 10, 1000, 10})
            = packed_dword(| 1000*m + 10*o | 1009*i + 10*k | 1000*e + 10*g | 1000*a + 10*c |)
    

    as 4-digit decimals:

    t3_even = packed_dword(| m0o0 | i0k0 | e0g0 | a0c0 |)
    
  4. Similarly calculate partial values of odd digit:

    t4_odd  = pmaddwd(t2_odd,  {100, 1, 100, 1, 100, 1, 100, 1})
            = packed_dword(| 100*n + p | 100*j + l | 100*f + h | 100*b + d |)
    

    as 4-digit decimals:

    t4_odd  = packed_dword(| 0n0p | 0j0l | 0f0h | 0b0d |)
    
  5. Add partial results from step 3 and 4 (shown as decimals):

    t5 = t3_even + t4_odd =
    
       = packed_dword(| m0o0 | i0k0 | e0g0 | a0c0 |)
       + packed_dword(| 0n0p | 0j0l | 0f0h | 0b0d |)
    
       = packed_dword(| mnop | ijkl | efgh | abcd |)
    
  6. Last step is to form 8-digits partial results. Since numbers are less than 10000, higher words of t5 are zero:

    t5 = packed_dword(| 0000 | mnop | 0000 | ijkl | 0000 | efgh | 0000 | abcd |)
    
    t6 = t5 | (t5 >> 16)
       = packed_dword(| 0000 | mnop | mnop | ijkl | ijkl | efgh | efgh | abcd |)
                                      ^^^^^^^^^^^                 ^^^^^^^^^^^
                                         these dwords are processed later
    
  7. Calculate results:

    t7 = pmaddwd(t6, {10000, 1, 0, 0, 10000, 1, 0, 0})
       = packed_dword(|      0      |   ijklmnop  |      0      |   abcdefgh  |)
    

64-bit multiplication is available in SSE starting from SSE4.1, so portable way to get final 64-bit value requires a CPU multiplication.

SSE4.1

In SSE4.1 conversion of 32-digit numbers is very easy. String is split to 16-digit substrings and each substring is processed in steps 1 .. 5 of the algorithm. After 5th step we have two values:

t5_lo = packed_dword(| 0000 | mnop | 0000 | ijkl | 0000 | efgh | 0000 | abcd |)
t5_hi = packed_dword(| 0000 | MNOP | 0000 | IJKL | 0000 | EFGH | 0000 | ABCD |)

Steps 6 and 7 are different:

  1. SSE4.1 supports PACKUSDW, thanks to that t5_lo and t5_hi could be joined with single instruction:

    t6 = packusdw(t5_hi, t5_lo)
       = packed_word(| MNOP | IJKL | EFGH | ABCD | mnop | ijkl | efgh | abcd |)
    
  2. Calculate results:

    t7 = pmaddwd(t6, {10000, 1, 10000, 1, 10000, 1, 10000, 1})
       = packed_dword(|  IJKLMNOP  |   ABCDEFGH  |  ijklmnop   |  abcdefgh   |)
    
  3. SSE4.1 supports also PMULDQ, so producing 64-bit values in SSE registers is possible:

    m1 = pmuldq(t7, {100000000, 100000000}
       = packed_qword(|     ABCDEFGH00000000     |       abcdefgh00000000    |)
    
    m2 = packed_dword(|     0      |    t7[3]    |      0      |    t7[1]    |)
       = packed_qword(|     0      |   IJKLMNOP  |      0      |   ijklmnop  |)
    
    t8 = m1 + m2
       = packed_qword(|     ABCDEFGHIJKLMNOP     |       abcdefgijklmnop     |)
    

However calculating 128-bit value require CPU instructions.

Sample code

Sample code is available on gitub. Only plain SSE implementation is available (see file parse.sse.c).

See also