Floating point tricks

Author:Wojciech Muła
Added on:15.06.2008

Floating point numbers store at most 23 (32-bit float) or 52 (64-bit double) bits of mantissa; the most significant bit is always set (if number is normalized), thus 24 or 53 bits are known, but MSB bit isn't stored.

Note: methods presented here produce correct results only if certain conditions are met — they are not general, don't work for NaNs, denormalized numbers, etc.

Converting float to int

Few years ago I've developed a method that do not need any floating-point operations — description is written in Polish, but sample code should be easy to understand. In short words mantissa is completed with the implicit bit 23 (or 52) and treated as a natural number. Then this number is shifted left or right to place the dot position at 0 — the shift amount depends on the exponent value.

Another method uses floating point operations and is limited to positive number less than 223 (float) (and 252 for doubles).

When value 223 is added to another float, then just 23 most significant bits are stored — the fraction bits are shifted out.

Let see an example, number 7.25 (111.01) has following floating point representation:

 S exp+127    normalized mantissa

After adding 223:

 S exp+127    normalized mantissa

Mantissa field treated as natural number contains an integer part of number.

Because addition is used, then the result is rounded or truncated, depending on the current FPU's rounding settings. When bare bit shift is used instead of addition (as in the method mentioned earlier), then the number is always truncated.

Note: this method could be used to get fixed point, just smaller value is needed: 223 − fractionbits, but this also limit the maximum value of float.

Implementation from sample program float2int.c:

void convert_simple() {
        double C = (1ll << 52);
        union {
                double  val;
                int64_t bin;
        } tmp;

        int i;
        for (i=0; i < SIZE; i++) {
                tmp.val  = in[i] + C;
                tmp.bin  = tmp.bin & 0x000fffffffffffffll;
                out_2[i] = tmp.bin;

However this method is slower than ordinal FPU instructions, i.e.:

fldl    (%eax)
fistpl  (%ebx)  (or fisttpl (%ebx) on CPU with SSSE3)

Float rounding

Very similar method could be used to perform float rounding. First value 223 is added, resulting in fraction bits lost, then same value is subtracted.

There is FPU instruction FRNDINT that does the job, but is very slow:

fldl    (%eax)
fstpl   (%ebx)

Faster is another FPU code:

fldl    (%eax)  ; load fp
fistpl  tmp     ; convert to 64-bit int
fildl   tmp     ; convert int back to float
fstpl   (%ebx)  ; save rounded fp

But the trick beats both FPU methods. Sample program round2.c has been compiled with following options:

gcc -march=core2 -O3 -Wall -pedantic -std=c99 round.c -o test

Then program was run on C2D E8200 @ 2.6GHz:

$ time ./test 0 10000
stdlib, iterations = 10000, size = 65536

real    0m13.344s
user    0m13.333s
sys     0m0.000s

$ time ./test 1 10000
FPU FRNDINT, iterations = 10000, size = 65536

real    0m15.748s
user    0m15.733s
sys     0m0.000s

$ time ./test 2 10000
FPU FISTP/FILD, iterations = 10000, size = 65536

real    0m1.280s
user    0m1.276s
sys     0m0.004s

$ time ./test 3 10000
simple method (C impl.), iterations = 10000, size = 65536

real    0m0.456s
user    0m0.452s
sys     0m0.000s

SSE: converting 64-bit to double [21.06.2008]

On Intel forum user pvercello asked about SSE-assisted conversion of signed int 64-bit (less than 252) to double. I have proposed a simple algorithm to convert unsigned ints, that use trick similar to these presented above. Then pvercello shown better approach, that need fewer operations. Finally I found a simple, branch-less way to do a singed conversion, unfortunately slower than FPU conversion.

Below is unsigned int to float algorithm outline. Please follow the discussion to find details about signed conversion. The idea is based on the same property of floats that is use in float -> int conversion.

  1. Assure, that the number is less than 252 (or 223 for floats):

    if (x < (1ll << 52))
            return ERANGE;
  2. Set the exponent of floating point number to 52 — value of such double is 252 + x:

    uint64_t x_double = (1ll << 52) | x; // bitwise or
  3. Last step is to subtract floating point value 252:

    double X = (*(double*)&x_double) - (double)(1ll << 52);

Sample program convert_int52_double.c is available.