Author: | Wojciech Muła |
---|---|

Added on: | 15.06.2008 |

Floating point numbers store at most 23 (32-bit `float`) or 52 (64-bit
`double`) bits of mantissa; the most significant bit is always set (if
number is normalized), thus 24 or 53 bits are known, but MSB bit isn't
stored.

**Note**: methods presented here produce correct results only if
certain conditions are met — they are not general, don't work
for NaNs, denormalized numbers, etc.

Few years ago I've developed a method that do not need any floating-point operations — description is written in Polish, but sample code should be easy to understand. In short words mantissa is completed with the implicit bit 23 (or 52) and treated as a natural number. Then this number is shifted left or right to place the dot position at 0 — the shift amount depends on the exponent value.

Another method uses floating point operations and is limited to
positive number less than 2^{23} (float) (and 2^{52}
for doubles).

When value 2^{23} is added to another float, then just 23
most significant bits are stored — the fraction bits are shifted out.

Let see an example, number 7.25 (111.01) has following floating point representation:

+-+--------+-----------------------+ |0|10000001|11010000000000000000000| +-+--------+-----------------------+ S exp+127 normalized mantissa

After adding 2^{23}:

+-+--------+-----------------------+ |0|10010110|00000000000000000000111| +-+--------+-----------------------+ S exp+127 normalized mantissa

Mantissa field **treated as natural number** contains an integer part of number.

Because addition is used, then the result is rounded or truncated, depending on the current FPU's rounding settings. When bare bit shift is used instead of addition (as in the method mentioned earlier), then the number is always truncated.

**Note**: this method could be used to get fixed point, just smaller value
is needed: 2^{23 − fractionbits}, but this also limit the maximum value
of float.

Implementation from sample program float2int.c:

void convert_simple() { double C = (1ll << 52); union { double val; int64_t bin; } tmp; int i; for (i=0; i < SIZE; i++) { tmp.val = in[i] + C; tmp.bin = tmp.bin & 0x000fffffffffffffll; out_2[i] = tmp.bin; } }

However this method is slower than ordinal FPU instructions, i.e.:

fldl (%eax) fistpl (%ebx) (or fisttpl (%ebx) on CPU with SSSE3)

Very similar method could be used to perform float rounding. First value 2^{23}
is added, resulting in fraction bits lost, then same value is subtracted.

There is FPU instruction `FRNDINT` that does the job, but is very slow:

fldl (%eax) frndint fstpl (%ebx)

Faster is another FPU code:

fldl (%eax) ; load fp fistpl tmp ; convert to 64-bit int fildl tmp ; convert int back to float fstpl (%ebx) ; save rounded fp

But the trick beats both FPU methods. Sample program round2.c has been compiled with following options:

gcc -march=core2 -O3 -Wall -pedantic -std=c99 round.c -o test

Then program was run on C2D E8200 @ 2.6GHz:

$ time ./test 0 10000 stdlib, iterations = 10000, size = 65536 real 0m13.344s user 0m13.333s sys 0m0.000s $ time ./test 1 10000 FPU FRNDINT, iterations = 10000, size = 65536 real 0m15.748s user 0m15.733s sys 0m0.000s $ time ./test 2 10000 FPU FISTP/FILD, iterations = 10000, size = 65536 real 0m1.280s user 0m1.276s sys 0m0.004s $ time ./test 3 10000 simple method (C impl.), iterations = 10000, size = 65536 real 0m0.456s user 0m0.452s sys 0m0.000s

On Intel forum user **pvercello** asked about SSE-assisted conversion of
signed int 64-bit (less than 2^{52}) to double. I have proposed a simple
algorithm to convert unsigned ints, that use trick similar to these presented
above. Then **pvercello** shown better approach, that need fewer operations.
Finally I found a simple, branch-less way to do a singed conversion,
unfortunately slower than FPU conversion.

Below is unsigned int to float algorithm outline. Please follow the discussion to find details about signed conversion. The idea is based on the same property of floats that is use in float -> int conversion.

Assure, that the number is less than 2

^{52}(or 2^{23}for floats):if (x < (1ll << 52)) return ERANGE;

Set the exponent of floating point number to 52 — value of such double is 2

^{52}+*x*:uint64_t x_double = (1ll << 52) | x; // bitwise or

Last step is to subtract floating point value 2

^{52}:double X = (*(double*)&x_double) - (double)(1ll << 52);

Sample program convert_int52_double.c is available.