Author: | Wojciech Muła |
---|---|
Added on: | 2011-10-21 |
Updated on: | 2018-03-25 (updated test results, link to Milo's benchmarks) |
Contents
With SSE2 instructions it's possible to convert up to four numbers in range 0..9999_9999 and get 32 decimal digits results. This texts describe code for two numbers (suitable for 64-bit conversions) and for one number (suitable for 32-bit conversions).
The outline of algorithm 1 has been posted by Piotr Wyderski on the usenet group pl.comp.lang.c, I merely implemented it. The main idea is to perform in parallel divisions & modulo by 108 (for 64-bit numbers), then 104, 102 and finally 101.
I've developed the algorithm 2, converting just a single 8-digit number. First division & modulo by 108 is performed, and an input vector is formed [abcd, abcd, abcd, abcd, efgh, efgh, efgh, efgh]. Then this vector is divided by vector [103, 102, 101, 100, 103, 102, 101, 100] yielding vector [a, ab, abc, abcd, e, ef, efg, efgh]. Obtaining an 8 digits result from this vector requires some shuffling, multiply by 10 and a subtract.
SSE doesn't provide an integer division, but division by a constant can be achieved using multiplication. This is well known and widespread technique, see for example Division by Invariant Integers using Multiplication by Torbjörn Granlund & Peter L. Montgomery, it's also described in Hackers Delight and Reciprocal Multiplication.
The division algorithm follows scheme:
k := 10^n div := (x * MC) >> shift mod := x - div * k
There are two multiplication, one subtract and (sometimes optional) one right shift. The constant MC is calculated with following formula:
MC := (2^d - k + correction)/k shift := d
Where d >= word bits and word bits = {8, 16, 32, 64, ...}. The value of correction depends on divisor and number of bits, please check one of the notes source for details.
Instruction pmulhuw does in parallel (word_x[i] * word_y[i]) >> 16.
If one component has form 1 << n[i] (n[i] = 0 .. 15), then the result of the instruction is word_x[i] >> (16 - n[i]), i.e. a word shifted right by specified amount from 1 to 16. This is used in algorithm 2.
(Likewise pmullw can be used to perform shifts left.)
Both algorithms require this. Removing leading zeros is similar procedure to strchr:
pcmpeqb packed_byte('0'), xmmreg pmovmskb xmmreg, reg not reg -- invert bits or $0x8000, reg -- do not trim '0' if val was 0 bsf reg, reg -- reg contains position of first non-zero digit
On a CPU with SSE4.2 this could be done with single instruction.
Division by 108 is not suitable for SSE code, because require a 64-bit multiplication. In a 64-bit code such multiplication is available with CPU instructions, but in a 32-bit code require 4 multiplications and a long add — this would kill performance.
Division by 104 require a 32-bit multiplication. SSE have pmuludq that does 2 multiplications and store full a 64-bit result.
Division by 102 require a 16-bit multiplication. SSE have pmulhuw that does 8 multiplications and store the high 16-bit of result; likewise pmullw stores the low 16-bit of result. This perfectly fits our needs.
Division by 101 require an 8-bit multiplication, but SSE does not provide any instruction that vertically multiply bytes. We have to use a 16-bit multiplication.
Code for conversion 64-bit numbers consist following steps:
Start with two 8-digit numbers (32-bit values):
[ ____ ____ | ____ ____ | 9765 4321 | 1234 5678 ] -- packed dword (_ = 0)
Then divmod 32-bit numbers by 104 (pmuludq), results are 32-bit numbers:
[ ____ 9765 | ____ 4321 | ____ 1234 | ____ 5678 ] -- packed dword
Then divmod 32-bit numbers by 102 (pmulhuw and pmullw), results are 16-bit numbers:
[ __97 | __65 | __43 | __21 | __12 | __34 | __56 | __78 ] -- packed word
Then divmod 32-bit numbers by 101 (again pmulhuw and pmullw), results are 16-bit numbers:
[ ___7 | ___5 | ___3 | ___1 | ___2 | ___4 | ___6 | ___8 ] -- packed word [ ___9 | ___6 | ___4 | ___2 | ___1 | ___3 | ___5 | ___7 ] -- packed word
Finally join decimal digits, convert to ASCII and remove leading zeros.
This algorithm can convert only a single number from range 0..9999_9999. The main idea is to divide 4-digits numbers by 103, 102 and 101 at once.
Divide & modulo by 104 (pmuludq):
v0 = [ 0000 | abcd | 0000 | efgh ] -- packed dword
Populate words:
v1 = [ abcd | abcd | abcd | abcd | efgh | efgh | efgh | efgh ] -- packed word
Multiply v1 by 4 with psllw (safe, because 9999 ⋅ 4 < 216 − 1):
v2 = [ 4*abcd | 4*abcd | 4*abcd | 4*abcd | 4*efgh | 4*efgh | 4*efgh | 4*efgh ]
Divide by 103, 102, 101, and 100 (single pmulhuw):
v3 = [ a * 2^(7 + 2) | ab * 2^(3 + 2) | abc * 2^(1 + 2) | 2*abcd | <likewise digits efgh> ]
Shift right by 9, 5, 3 and 1 bits each 4-words group (single pmulhuw, see subsection below):
v4 = [ a | ab | abc | abcd | e | ef | efg | efgh ]
Calculate modulo:
v5 = packed_qword(v4) >> 16 v6 = packed_word(v5) * 10 = = [ 0 | a0 | ab0 | abc0 | 0 | e0 | ef0 | efg0 ] v7 = v4 - v6 = = [ a | b | c | d | e | f | g | h ]
Pack words to bytes (packuswb), convert to ASCII and remove leading zeros.
Ad 3. With pmuluhw we cannot divide by 1, the minimum is 2. Because pmulhuw is invoked two times (for divide and shift), we have to multiply input by 4, to finally get the abcd & efgh after step 5.
Sample program provides following functions:
Please read the source header to find how to compile. Program can be used to measure speed of the selected procedure and to print results, thus allow to verify correctness.
The best times are considered. For larger number speedup is around 3 times, while for short numbers time is comparable to the plain C code, and for very short numbers is much worse.
Tests were run on my Linux box with Core2 Duo E8200 (2GHz); gcc 4.6.1 (-O3 flag) was used to compile the program.
number range | iterations | C [s] | SSE [s] | speedup |
---|---|---|---|---|
0 .. 99 | 1_000_000 | 0.492 | 0.880 | 56% |
0 .. 9999 | 10_000 | 1.076 | 0.884 | 122% |
0 .. 9999_9999 | 1 | 2.480 | 0.872 | 284% |
1000_0000 .. 9999_9999 | 1 | 2.276 | 0.784 | 290% |
number range | iterations | C [s] | SSE [s] | speedup |
---|---|---|---|---|
0 .. 99 | 1_000_000 | 0.492 | 0.808 | 61% |
0 .. 9999 | 10_000 | 1.076 | 0.812 | 132% |
0 .. 9999_9999 | 1 | 2.480 | 0.728 | 340% |
1000_0000 .. 9999_9999 | 1 | 2.276 | 0.812 | 280% |
Compilation with GCC 7.3.0.
number range | iterations | C [ms] | SSE [ms] | speedup |
---|---|---|---|---|
0 .. 99 | 100_000 | 62 | 98 | 0.62 |
0 .. 9999 | 1_000 | 106 | 98 | 1.08 |
0 .. 9999_9999 | 1 | 2393 | 881 | 2.71 |
1000_0000 .. 9999_9999 | 1 | 2197 | 881 | 2.49 |
number range | iterations | C [ms] | SSE [ms] | speedup |
---|---|---|---|---|
0 .. 99 | 100_000 | 62 | 78 | 0.79 |
0 .. 9999 | 1_000 | 106 | 76 | 1.34 |
0 .. 9999_9999 | 1 | 2393 | 783 | 3.05 |
1000_0000 .. 9999_9999 | 1 | 2197 | 704 | 3.12 |
Compilation with GCC 7.3.0.
number range | iterations | C [ms] | SSE [ms] | speedup |
---|---|---|---|---|
0 .. 99 | 100_000 | 31 | 154 | 0.20 |
0 .. 9999 | 1_000 | 51 | 155 | 0.33 |
0 .. 9999_9999 | 1 | 1077 | 1389 | 0.78 |
1000_0000 .. 9999_9999 | 1 | 986 | 1533 | 0.64 |
number range | iterations | C [ms] | SSE [ms] | speedup |
---|---|---|---|---|
0 .. 99 | 100_000 | 31 | 123 | 0.25 |
0 .. 9999 | 1_000 | 51 | 122 | 0.42 |
0 .. 9999_9999 | 1 | 1077 | 1221 | 0.88 |
1000_0000 .. 9999_9999 | 1 | 986 | 1099 | 0.90 |
Milo Yip compared different itoa and dtoa implementations on Core i7, including the algorithm 2 which use SSE2 instructions.
Results for itoa are interesting: SSE2 version is not as good as it seemed to be. Tricky branchlut algorithm is only 10% slower, moreover is perfectly portable. One obvious drawback of this method is using lookup-table - in real environment where is a big pressure on cache, memory access could be a bottleneck.