Author: | Wojciech Muła |
---|---|
Added on: | 15.06.2008 |
Floating point numbers store at most 23 (32-bit float) or 52 (64-bit double) bits of mantissa; the most significant bit is always set (if number is normalized), thus 24 or 53 bits are known, but MSB bit isn't stored.
Note: methods presented here produce correct results only if certain conditions are met — they are not general, don't work for NaNs, denormalized numbers, etc.
Few years ago I've developed a method that do not need any floating-point operations — description is written in Polish, but sample code should be easy to understand. In short words mantissa is completed with the implicit bit 23 (or 52) and treated as a natural number. Then this number is shifted left or right to place the dot position at 0 — the shift amount depends on the exponent value.
Another method uses floating point operations and is limited to positive number less than 223 (float) (and 252 for doubles).
When value 223 is added to another float, then just 23 most significant bits are stored — the fraction bits are shifted out.
Let see an example, number 7.25 (111.01) has following floating point representation:
+-+--------+-----------------------+ |0|10000001|11010000000000000000000| +-+--------+-----------------------+ S exp+127 normalized mantissa
After adding 223:
+-+--------+-----------------------+ |0|10010110|00000000000000000000111| +-+--------+-----------------------+ S exp+127 normalized mantissa
Mantissa field treated as natural number contains an integer part of number.
Because addition is used, then the result is rounded or truncated, depending on the current FPU's rounding settings. When bare bit shift is used instead of addition (as in the method mentioned earlier), then the number is always truncated.
Note: this method could be used to get fixed point, just smaller value is needed: 223 − fractionbits, but this also limit the maximum value of float.
Implementation from sample program float2int.c:
void convert_simple() { double C = (1ll << 52); union { double val; int64_t bin; } tmp; int i; for (i=0; i < SIZE; i++) { tmp.val = in[i] + C; tmp.bin = tmp.bin & 0x000fffffffffffffll; out_2[i] = tmp.bin; } }
However this method is slower than ordinal FPU instructions, i.e.:
fldl (%eax) fistpl (%ebx) (or fisttpl (%ebx) on CPU with SSSE3)
Very similar method could be used to perform float rounding. First value 223 is added, resulting in fraction bits lost, then same value is subtracted.
There is FPU instruction FRNDINT that does the job, but is very slow:
fldl (%eax) frndint fstpl (%ebx)
Faster is another FPU code:
fldl (%eax) ; load fp fistpl tmp ; convert to 64-bit int fildl tmp ; convert int back to float fstpl (%ebx) ; save rounded fp
But the trick beats both FPU methods. Sample program round2.c has been compiled with following options:
gcc -march=core2 -O3 -Wall -pedantic -std=c99 round.c -o test
Then program was run on C2D E8200 @ 2.6GHz:
$ time ./test 0 10000 stdlib, iterations = 10000, size = 65536 real 0m13.344s user 0m13.333s sys 0m0.000s $ time ./test 1 10000 FPU FRNDINT, iterations = 10000, size = 65536 real 0m15.748s user 0m15.733s sys 0m0.000s $ time ./test 2 10000 FPU FISTP/FILD, iterations = 10000, size = 65536 real 0m1.280s user 0m1.276s sys 0m0.004s $ time ./test 3 10000 simple method (C impl.), iterations = 10000, size = 65536 real 0m0.456s user 0m0.452s sys 0m0.000s
On Intel forum user pvercello asked about SSE-assisted conversion of signed int 64-bit (less than 252) to double. I have proposed a simple algorithm to convert unsigned ints, that use trick similar to these presented above. Then pvercello shown better approach, that need fewer operations. Finally I found a simple, branch-less way to do a singed conversion, unfortunately slower than FPU conversion.
Below is unsigned int to float algorithm outline. Please follow the discussion to find details about signed conversion. The idea is based on the same property of floats that is use in float -> int conversion.
Assure, that the number is less than 252 (or 223 for floats):
if (x < (1ll << 52)) return ERANGE;
Set the exponent of floating point number to 52 — value of such double is 252 + x:
uint64_t x_double = (1ll << 52) | x; // bitwise or
Last step is to subtract floating point value 252:
double X = (*(double*)&x_double) - (double)(1ll << 52);
Sample program convert_int52_double.c is available.