Author: | Wojciech Muła |
---|---|

Added on: | 2011-10-21 |

Updated on: | 2018-03-25 (updated test results, link to Milo's benchmarks) |

Contents

With SSE2 instructions it's possible to convert up to four numbers in range 0..9999_9999 and get 32 decimal digits results. This texts describe code for two numbers (suitable for 64-bit conversions) and for one number (suitable for 32-bit conversions).

The outline of algorithm 1 has been posted by Piotr Wyderski
on the usenet group `pl.comp.lang.c`, I merely implemented it.
The main idea is to perform in **parallel** divisions & modulo
by 10^{8} (for 64-bit numbers), then 10^{4},
10^{2} and finally 10^{1}.

I've developed the algorithm 2, converting just a single 8-digit number.
First division & modulo by 10^{8} is performed, and an input
vector is formed [*abcd*, *abcd*, *abcd*, *abcd*, *efgh*, *efgh*, *efgh*, *efgh*].
Then this vector is divided by vector [10^{3}, 10^{2}, 10^{1}, 10^{0}, 10^{3}, 10^{2}, 10^{1}, 10^{0}] yielding vector [*a*, *ab*, *abc*, *abcd*, *e*, *ef*, *efg*, *efgh*]. Obtaining an 8 digits result from this vector requires
some shuffling, multiply by 10 and a subtract.

SSE doesn't provide an integer division, but division by a constant can be achieved using multiplication. This is well known and widespread technique, see for example Division by Invariant Integers using Multiplication by Torbjörn Granlund & Peter L. Montgomery, it's also described in Hackers Delight and Reciprocal Multiplication.

The division algorithm follows scheme:

k := 10^n div := (x * MC) >> shift mod := x - div * k

There are two multiplication, one subtract and (sometimes optional)
one right shift. The constant `MC` is calculated with following formula:

MC := (2^d - k + correction)/k shift := d

Where `d >= word bits` and `word bits = {8, 16, 32, 64, ...}`.
The value of `correction` depends on divisor and number of bits, please
check one of the notes source for details.

Instruction `pmulhuw` does in parallel `(word_x[i] * word_y[i]) >> 16`.

If one component has form `1 << n[i]` (`n[i] = 0 .. 15`), then the result
of the instruction is `word_x[i] >> (16 - n[i])`, i.e. a word shifted right
by specified amount from 1 to 16. This is used in algorithm 2.

(Likewise `pmullw` can be used to perform shifts left.)

Both algorithms require this. Removing leading zeros is similar
procedure to `strchr`:

pcmpeqb packed_byte('0'), xmmreg pmovmskb xmmreg, reg not reg -- invert bits or $0x8000, reg -- do not trim '0' if val was 0 bsf reg, reg -- reg contains position of first non-zero digit

On a CPU with SSE4.2 this could be done with single instruction.

Division by 10^{8} is not suitable for SSE code, because
require a 64-bit multiplication. In a 64-bit code such multiplication
is available with CPU instructions, but in a 32-bit code require
4 multiplications and a long add — this would kill performance.

Division by 10^{4} require a 32-bit multiplication. SSE have
`pmuludq` that does 2 multiplications and store full a 64-bit result.

Division by 10^{2} require a 16-bit multiplication. SSE have
`pmulhuw` that does 8 multiplications and store the high 16-bit of
result; likewise `pmullw` stores the low 16-bit of result. This perfectly
fits our needs.

Division by 10^{1} require an 8-bit multiplication, but SSE does not
provide any instruction that vertically multiply bytes. We have to use
a 16-bit multiplication.

Code for conversion 64-bit numbers consist following steps:

Start with two 8-digit numbers (32-bit values):

[ ____ ____ | ____ ____ | 9765 4321 | 1234 5678 ] -- packed dword (_ = 0)

Then divmod 32-bit numbers by 10

^{4}(`pmuludq`), results are 32-bit numbers:[ ____ 9765 | ____ 4321 | ____ 1234 | ____ 5678 ] -- packed dword

Then divmod 32-bit numbers by 10

^{2}(`pmulhuw`and`pmullw`), results are 16-bit numbers:[ __97 | __65 | __43 | __21 | __12 | __34 | __56 | __78 ] -- packed word

Then divmod 32-bit numbers by 10

^{1}(again`pmulhuw`and`pmullw`), results are 16-bit numbers:[ ___7 | ___5 | ___3 | ___1 | ___2 | ___4 | ___6 | ___8 ] -- packed word [ ___9 | ___6 | ___4 | ___2 | ___1 | ___3 | ___5 | ___7 ] -- packed word

Finally join decimal digits, convert to ASCII and remove leading zeros.

This algorithm can convert only a single number from range 0..9999_9999.
The main idea is to divide 4-digits numbers by 10^{3}, 10^{2}
and 10^{1} at once.

Divide & modulo by 10

^{4}(`pmuludq`):v0 = [ 0000 | abcd | 0000 | efgh ] -- packed dword

Populate words:

v1 = [ abcd | abcd | abcd | abcd | efgh | efgh | efgh | efgh ] -- packed word

Multiply

`v1`by 4 with`psllw`(safe, because 9999 ⋅ 4 < 2^{16}− 1):v2 = [ 4*abcd | 4*abcd | 4*abcd | 4*abcd | 4*efgh | 4*efgh | 4*efgh | 4*efgh ]

Divide by 10

^{3}, 10^{2}, 10^{1}, and 10^{0}(single`pmulhuw`):v3 = [ a * 2^(7 + 2) | ab * 2^(3 + 2) | abc * 2^(1 + 2) | 2*abcd | <likwise digits efgh> ]

Shift right by 9, 5, 3 and 1 bits each 4-words group (single

`pmulhuw`, see subsection below):v4 = [ a | ab | abc | abcd | e | ef | efg | efgh ]

Calculate modulo:

v5 = packed_qword(v4) >> 16 v6 = packed_word(v5) * 10 = = [ 0 | a0 | ab0 | abc0 | 0 | e0 | ef0 | efg0 ] v7 = v4 - v6 = = [ a | b | c | d | e | f | g | h ]

Pack words to bytes (

`packuswb`), convert to ASCII and remove leading zeros.

Ad 3. With `pmuluhw` we cannot divide by 1, the minimum is 2. Because
`pmulhuw` is invoked two times (for divide and shift), we have to multiply
input by 4, to finally get the `abcd` & `efgh` after step 5.

- Signed numbers conversion requires to put a sign char in the front of string. This have to be done with CPU (scalar) instructions.
- Vectorized code generate 8 or 16 digits at single run. The longest 32-bit number has 10 decimal digits, and 64-bit — 20 digits, thus 2 or 4 digits are not converted in SSE code. Using SSE code to obtain them wouldn't be wise decision.

Sample program provides following functions:

`utoa64_sse`— algortihm 1: described here;`utoa32_sse`— algorithm 1: implementation for 32-bit numbers;`utoa32_sse_2`— algorithm 2`utoa32`&`uta64`— plain C implementations; faster than`sprintf("%d", &x)`.

Please read the source header to find how to compile. Program can be used to measure speed of the selected procedure and to print results, thus allow to verify correctness.

The **best times** are considered. For larger number speedup is around 3
times, while for short numbers time is comparable to the plain C code, and
for very short numbers is much worse.

Tests were run on my Linux box with Core2 Duo E8200 (2GHz); gcc 4.6.1
(`-O3` flag) was used to compile the program.

number range | iterations | C [s] | SSE [s] | speedup |
---|---|---|---|---|

0 .. 99 | 1_000_000 | 0.492 | 0.880 | 56% |

0 .. 9999 | 10_000 | 1.076 | 0.884 | 122% |

0 .. 9999_9999 | 1 | 2.480 | 0.872 | 284% |

1000_0000 .. 9999_9999 | 1 | 2.276 | 0.784 | 290% |

number range | iterations | C [s] | SSE [s] | speedup |
---|---|---|---|---|

0 .. 99 | 1_000_000 | 0.492 | 0.808 | 61% |

0 .. 9999 | 10_000 | 1.076 | 0.812 | 132% |

0 .. 9999_9999 | 1 | 2.480 | 0.728 | 340% |

1000_0000 .. 9999_9999 | 1 | 2.276 | 0.812 | 280% |

Compilation with GCC 7.3.0.

number range | iterations | C [ms] | SSE [ms] | speedup |
---|---|---|---|---|

0 .. 99 | 100_000 | 62 | 98 | 0.62 |

0 .. 9999 | 1_000 | 106 | 98 | 1.08 |

0 .. 9999_9999 | 1 | 2393 | 881 | 2.71 |

1000_0000 .. 9999_9999 | 1 | 2197 | 881 | 2.49 |

number range | iterations | C [ms] | SSE [ms] | speedup |
---|---|---|---|---|

0 .. 99 | 100_000 | 62 | 78 | 0.79 |

0 .. 9999 | 1_000 | 106 | 76 | 1.34 |

0 .. 9999_9999 | 1 | 2393 | 783 | 3.05 |

1000_0000 .. 9999_9999 | 1 | 2197 | 704 | 3.12 |

Compilation with GCC 7.3.0.

number range | iterations | C [ms] | SSE [ms] | speedup |
---|---|---|---|---|

0 .. 99 | 100_000 | 31 | 154 | 0.20 |

0 .. 9999 | 1_000 | 51 | 155 | 0.33 |

0 .. 9999_9999 | 1 | 1077 | 1389 | 0.78 |

1000_0000 .. 9999_9999 | 1 | 986 | 1533 | 0.64 |

number range | iterations | C [ms] | SSE [ms] | speedup |
---|---|---|---|---|

0 .. 99 | 100_000 | 31 | 123 | 0.25 |

0 .. 9999 | 1_000 | 51 | 122 | 0.42 |

0 .. 9999_9999 | 1 | 1077 | 1221 | 0.88 |

1000_0000 .. 9999_9999 | 1 | 986 | 1099 | 0.90 |

Milo Yip compared different itoa and dtoa implementations on Core i7, including the algorithm 2 which use SSE2 instructions.

Results for itoa are interesting: SSE2 version is not as good as it seemed to be. Tricky branchlut algorithm is only 10% slower, moreover is perfectly portable. One obvious drawback of this method is using lookup-table - in real environment where is a big pressure on cache, memory access could be a bottleneck.

- 2011-11-06
- 2013-11-24 - minor updates
- 2011-11-06 -
`utoa32_sse_2` - 2011-10-15 -
`utoa64_sse` - 2011-10-09 - initial results
- 2011-10-08 - optimization
- 2011-10-07 - first
`utoa32_sse`version