SSE: conversion integers to decimal representation

Author:	Wojciech Muła
Added on:	2011-10-21
Updated on:	2018-03-25 (updated test results, link to Milo's benchmarks)

Contents

Introduction

With SSE2 instructions it's possible to convert up to four numbers in range 0..9999_9999 and get 32 decimal digits results. This texts describe code for two numbers (suitable for 64-bit conversions) and for one number (suitable for 32-bit conversions).

The outline of algorithm 1 has been posted by Piotr Wyderski on the usenet group pl.comp.lang.c, I merely implemented it. The main idea is to perform in parallel divisions & modulo by 10⁸ (for 64-bit numbers), then 10⁴, 10² and finally 10¹.

I've developed the algorithm 2, converting just a single 8-digit number. First division & modulo by 10⁸ is performed, and an input vector is formed [abcd, abcd, abcd, abcd, efgh, efgh, efgh, efgh]. Then this vector is divided by vector [10³, 10², 10¹, 10⁰, 10³, 10², 10¹, 10⁰] yielding vector [a, ab, abc, abcd, e, ef, efg, efgh]. Obtaining an 8 digits result from this vector requires some shuffling, multiply by 10 and a subtract.

SSE details

Division

SSE doesn't provide an integer division, but division by a constant can be achieved using multiplication. This is well known and widespread technique, see for example Division by Invariant Integers using Multiplication by Torbjörn Granlund & Peter L. Montgomery, it's also described in Hackers Delight and Reciprocal Multiplication.

The division algorithm follows scheme:

k   := 10^n
div := (x * MC) >> shift
mod := x - div * k

There are two multiplication, one subtract and (sometimes optional) one right shift. The constant MC is calculated with following formula:

MC := (2^d - k + correction)/k
shift := d

Where d >= word bits and word bits = {8, 16, 32, 64, ...}. The value of correction depends on divisor and number of bits, please check one of the notes source for details.

Variable shifts

Instruction pmulhuw does in parallel (word_x[i] * word_y[i]) >> 16.

If one component has form 1 << n[i] (n[i] = 0 .. 15), then the result of the instruction is word_x[i] >> (16 - n[i]), i.e. a word shifted right by specified amount from 1 to 16. This is used in algorithm 2.

(Likewise pmullw can be used to perform shifts left.)

Counting leading zeros

Both algorithms require this. Removing leading zeros is similar procedure to strchr:

pcmpeqb     packed_byte('0'), xmmreg
pmovmskb    xmmreg, reg
not         reg               -- invert bits
or          $0x8000, reg      -- do not trim '0' if val was 0
bsf         reg, reg          -- reg contains position of first non-zero digit

On a CPU with SSE4.2 this could be done with single instruction.

Algorithm 1

Division by 10⁸ is not suitable for SSE code, because require a 64-bit multiplication. In a 64-bit code such multiplication is available with CPU instructions, but in a 32-bit code require 4 multiplications and a long add — this would kill performance.

Division by 10⁴ require a 32-bit multiplication. SSE have pmuludq that does 2 multiplications and store full a 64-bit result.

Division by 10² require a 16-bit multiplication. SSE have pmulhuw that does 8 multiplications and store the high 16-bit of result; likewise pmullw stores the low 16-bit of result. This perfectly fits our needs.

Division by 10¹ require an 8-bit multiplication, but SSE does not provide any instruction that vertically multiply bytes. We have to use a 16-bit multiplication.

Code for conversion 64-bit numbers consist following steps:

Start with two 8-digit numbers (32-bit values):

[ ____ ____ | ____ ____ | 9765 4321 | 1234 5678 ]       -- packed dword (_ = 0)

Then divmod 32-bit numbers by 10⁴ (pmuludq), results are 32-bit numbers:

[ ____ 9765 | ____ 4321 | ____ 1234 | ____ 5678 ]       -- packed dword

Then divmod 32-bit numbers by 10² (pmulhuw and pmullw), results are 16-bit numbers:

[ __97 | __65 | __43 | __21 | __12 | __34 | __56 | __78 ]       -- packed word

Then divmod 32-bit numbers by 10¹ (again pmulhuw and pmullw), results are 16-bit numbers:

[ ___7 | ___5 | ___3 | ___1 | ___2 | ___4 | ___6 | ___8 ]       -- packed word
[ ___9 | ___6 | ___4 | ___2 | ___1 | ___3 | ___5 | ___7 ]       -- packed word

Finally join decimal digits, convert to ASCII and remove leading zeros.

Algorithm 2

This algorithm can convert only a single number from range 0..9999_9999. The main idea is to divide 4-digits numbers by 10³, 10² and 10¹ at once.

Divide & modulo by 10⁴ (pmuludq):

v0 = [ 0000 | abcd | 0000 | efgh ] -- packed dword

Populate words:

v1 = [ abcd | abcd | abcd | abcd | efgh | efgh | efgh | efgh ] -- packed word

Multiply v1 by 4 with psllw (safe, because 9999 ⋅ 4 < 2¹⁶ − 1):

v2 = [ 4*abcd | 4*abcd | 4*abcd | 4*abcd | 4*efgh | 4*efgh | 4*efgh | 4*efgh ]

Divide by 10³, 10², 10¹, and 10⁰ (single pmulhuw):

v3 = [ a * 2^(7 + 2) | ab * 2^(3 + 2) | abc * 2^(1 + 2) | 2*abcd | <likewise digits efgh> ]

Shift right by 9, 5, 3 and 1 bits each 4-words group (single pmulhuw, see subsection below):
```
v4 = [    a |   ab |  abc | abcd |    e |   ef |  efg | efgh ]
```

Calculate modulo:

v5 = packed_qword(v4) >> 16

v6 = packed_word(v5) * 10 =
   = [    0 |   a0 |  ab0 | abc0 |    0 |   e0 |  ef0 | efg0 ]

v7 = v4 - v6 =
   = [    a |    b |    c |    d |    e |    f |    g |    h ]

Pack words to bytes (packuswb), convert to ASCII and remove leading zeros.

Ad 3. With pmuluhw we cannot divide by 1, the minimum is 2. Because pmulhuw is invoked two times (for divide and shift), we have to multiply input by 4, to finally get the abcd & efgh after step 5.

Real-word requirements & limitations

Signed numbers conversion requires to put a sign char in the front of string. This have to be done with CPU (scalar) instructions.
Vectorized code generate 8 or 16 digits at single run. The longest 32-bit number has 10 decimal digits, and 64-bit — 20 digits, thus 2 or 4 digits are not converted in SSE code. Using SSE code to obtain them wouldn't be wise decision.

Implementation

Sample program provides following functions:

utoa64_sse — algortihm 1: described here;
utoa32_sse — algorithm 1: implementation for 32-bit numbers;
utoa32_sse_2 — algorithm 2
utoa32 & uta64 — plain C implementations; faster than sprintf("%d", &x).

Please read the source header to find how to compile. Program can be used to measure speed of the selected procedure and to print results, thus allow to verify correctness.

Tests results

The best times are considered. For larger number speedup is around 3 times, while for short numbers time is comparable to the plain C code, and for very short numbers is much worse.

Core2 Duo E8200 (2Gz)

Tests were run on my Linux box with Core2 Duo E8200 (2GHz); gcc 4.6.1 (-O3 flag) was used to compile the program.

Algorithm 1 — utoa32_sse

number range	iterations	C [s]	SSE [s]	speedup
0 .. 99	1_000_000	0.492	0.880	56%
0 .. 9999	10_000	1.076	0.884	122%
0 .. 9999_9999	1	2.480	0.872	284%
1000_0000 .. 9999_9999	1	2.276	0.784	290%

Algorithm 2 — utoa32_sse_2

number range	iterations	C [s]	SSE [s]	speedup
0 .. 99	1_000_000	0.492	0.808	61%
0 .. 9999	10_000	1.076	0.812	132%
0 .. 9999_9999	1	2.480	0.728	340%
1000_0000 .. 9999_9999	1	2.276	0.812	280%

Westmere i5 M540 @ 2.53GHz

Compilation with GCC 7.3.0.

Algorithm 1 — utoa32_sse

number range	iterations	C [ms]	SSE [ms]	speedup
0 .. 99	100_000	62	98	0.62
0 .. 9999	1_000	106	98	1.08
0 .. 9999_9999	1	2393	881	2.71
1000_0000 .. 9999_9999	1	2197	881	2.49

Algorithm 2 — utoa32_sse_2

number range	iterations	C [ms]	SSE [ms]	speedup
0 .. 99	100_000	62	78	0.79
0 .. 9999	1_000	106	76	1.34
0 .. 9999_9999	1	2393	783	3.05
1000_0000 .. 9999_9999	1	2197	704	3.12

Core i7-6700 CPU @ 3.40GHz

Compilation with GCC 7.3.0.

Algorithm 1 — utoa32_sse

number range	iterations	C [ms]	SSE [ms]	speedup
0 .. 99	100_000	31	154	0.20
0 .. 9999	1_000	51	155	0.33
0 .. 9999_9999	1	1077	1389	0.78
1000_0000 .. 9999_9999	1	986	1533	0.64

Algorithm 2 — utoa32_sse_2

number range	iterations	C [ms]	SSE [ms]	speedup
0 .. 99	100_000	31	123	0.25
0 .. 9999	1_000	51	122	0.42
0 .. 9999_9999	1	1077	1221	0.88
1000_0000 .. 9999_9999	1	986	1099	0.90

Changelog

2011-11-06
2013-11-24 - minor updates
2011-11-06 - utoa32_sse_2
2011-10-15 - utoa64_sse
2011-10-09 - initial results
2011-10-08 - optimization
2011-10-07 - first utoa32_sse version

SSE: conversion integers to decimal representation

Introduction

SSE details

Division

Variable shifts

Counting leading zeros

Algorithm 1

Algorithm 2

Real-word requirements & limitations

Implementation

Tests results

Core2 Duo E8200 (2Gz)

Algorithm 1 — utoa32_sse

Algorithm 2 — utoa32_sse_2

Westmere i5 M540 @ 2.53GHz

Algorithm 1 — utoa32_sse

Algorithm 2 — utoa32_sse_2

Core i7-6700 CPU @ 3.40GHz

Algorithm 1 — utoa32_sse

Algorithm 2 — utoa32_sse_2

See also

Changelog