Author: | Wojciech Muła |
---|---|
Added on: | 2014-01-26 |
Updated on: | 2018-11-26 (measurements from Skylake and SkylakeX) |
SSE provides not widely known control register, called MXCSR. This register plays three roles:
Possible errors in SSE floating point calculations are:
By default all invalid operations in SSE are masked, i.e. they are not converted into hardware exceptions. When exceptions are unmasked, then standard SIGFPE exception is raised.
Important: even if errors are masked, when erroneous situation occurs then calculations' slowdown is significant. So if our program slows down for unknown reason, it may be an error in SSE-related code — for example we load "random" values to XMM registers.
Error flags in the MXCSR are always updated, regardless of which exceptions are reported.
The flag "flush to zero" forces result 0 on underflow or denormal errors, and what is more important, these errors have no impact on calculations speed.
For example in the sample loop underflow occurs, because we try to multiply FLT_MIN by FLT_MIN (FLT_MIN = 2 − 127) — the result can't be represented in floating point.
float min_floats[4] = packed_float(FLT_MIN); void mulps_in_loop() { const int32_t iterations = 10000000; uint32_t dummy; __asm__ __volatile__( "movups min_floats, %%xmm0\n" "1:\n" "movaps %%xmm0, %%xmm1\n" "mulps %%xmm1, %%xmm1\n" "loop 1b\n" : "=c" (dummy) : "c" (iterations) ); }
competion time of the above loop [in seconds] | |||
---|---|---|---|
architecture | default settings | flush to zero | speed-up |
Core2 | 0.796s | 0.023 | 30 x |
Skylake | 0.019 | 0.019 | --- |
SkylakeX | 0.028 | 0.022 | 1.25 x |
A denormalized floating point number is a very small number of value (0 + fraction)⋅2 − 126. Such value appears, for example, when we divide FLT_MIN by 2.
There is a little problem — if the result of some operation on normalized numbers is a denormalized value it's not an SSE error. Error is reported only when one of operands is already denormalized.
So, where is the problem? If a result is denormalized, speed is noticeable degraded, but we can't detect the point where denormalization has occurred. This can be done only when denormalized value is used in subsequent calculations.
MXCSR has the flag "denormals are zeros", which forces 0 as result of an operation where at least one operand is denormalized, but do not prevent from obtaining a denormalized result from operation on normalized values.
Let summarize this with following program:
float tiny_value[4] = packed_float(FLT_MIN); float large_divisor[4] = packed_float(0.5); float final_value[4]; void test_loop() { const int32_t iterations = 10000000; uint32_t dummy; __asm__ __volatile__( "1:\n" "movups tiny_value, %%xmm0\n" "movups large_divisor, %%xmm1\n" "pxor %%xmm2, %%xmm2\n" "mulps %%xmm1, %%xmm0\n" // FLT_MIN * 0.5 => denormalized number "addps %%xmm2, %%xmm0\n" // denormalized + 0.0 => denormal exception "loop 1b\n" "movups %%xmm0, final_value\n" : "=c" (dummy) : "c" (iterations) : ); }
completion time: in seconds (speed-up to default settings) | |||
---|---|---|---|
architecture | default settings | denormals are zeros | flush to zero |
Core2 | 1.841 | 0.858 (2x) | 0.121 (15x) |
Skylake | 0.440 | 0.400 | 0.033 (13x) |
SkylakeX | 0.524 | 0.526 | 0.039 (13x) |
The test programs are available.