Author: | Wojciech Muła |
---|---|

Added on: | 2014-01-26 |

Updated on: | 2018-11-26 (measurements from Skylake and SkylakeX) |

SSE provides not widely known control register, called **MXCSR**. This
register plays three roles:

- Controls calculations:
- flag
**flush to zero**, - flag
**denormals are zeros**, - rounding mode (not covered in this text).

- flag
- Allow to mask/unmask floating-point exceptions.
- Save information about floating-point errors — these flags are sticky, i.e. a programmer is responsible for clearing them.

Possible errors in SSE floating point calculations are:

- division by zero,
- underflow,
- overflow,
- operations on denormalized values,
- invalid operations, like square root of negative number, division zero by zero.

By default all invalid operations in SSE are masked, i.e. they are not
converted into hardware exceptions. When exceptions are unmasked, then
standard `SIGFPE` exception is raised.

**Important**: even if errors are masked, when erroneous situation
occurs then calculations' slowdown is significant. So if our program
slows down for unknown reason, it may be an error in SSE-related
code — for example we load "random" values to XMM registers.

Error flags in the MXCSR are always updated, regardless of which exceptions are reported.

The flag "flush to zero" forces result 0 on **underflow** or **denormal** errors,
and what is more important, these errors have **no impact** on calculations
speed.

For example in the sample loop underflow occurs, because we try to multiply
`FLT_MIN` by `FLT_MIN` (`FLT_MIN` = 2^{ − 127}) — the result
can't be represented in floating point.

float min_floats[4] = packed_float(FLT_MIN); void mulps_in_loop() { const int32_t iterations = 10000000; uint32_t dummy; __asm__ __volatile__( "movups min_floats, %%xmm0\n" "1:\n" "movaps %%xmm0, %%xmm1\n" "mulps %%xmm1, %%xmm1\n" "loop 1b\n" : "=c" (dummy) : "c" (iterations) ); }

competion time of the above loop [in seconds] | |||
---|---|---|---|

architecture | default settings | flush to zero | speed-up |

Core2 | 0.796s | 0.023 | 30 x |

Skylake | 0.019 | 0.019 | --- |

SkylakeX | 0.028 | 0.022 | 1.25 x |

A denormalized floating point number is a very small number of value
(0 + *fraction*)⋅2^{ − 126}. Such value appears, for example, when we
divide `FLT_MIN` by 2.

There is a little problem — if the result of some operation on normalized
numbers is a denormalized value **it's not an SSE error**. Error is
reported only when one of operands is **already denormalized**.

So, where is the problem? If a result is denormalized, speed is noticeable degraded, but we can't detect the point where denormalization has occurred. This can be done only when denormalized value is used in subsequent calculations.

**MXCSR** has the flag "denormals are zeros", which forces 0 as result of an
operation where at least one operand is denormalized, but **do not prevent**
from obtaining a denormalized result from operation on normalized values.

Let summarize this with following program:

- first
`FLT_MIN`is multiplied by 0.5 resulting in denormalized value; - then this value is added to 0.

float tiny_value[4] = packed_float(FLT_MIN); float large_divisor[4] = packed_float(0.5); float final_value[4]; void test_loop() { const int32_t iterations = 10000000; uint32_t dummy; __asm__ __volatile__( "1:\n" "movups tiny_value, %%xmm0\n" "movups large_divisor, %%xmm1\n" "pxor %%xmm2, %%xmm2\n" "mulps %%xmm1, %%xmm0\n" // FLT_MIN * 0.5 => denormalized number "addps %%xmm2, %%xmm0\n" // denormalized + 0.0 => denormal exception "loop 1b\n" "movups %%xmm0, final_value\n" : "=c" (dummy) : "c" (iterations) : ); }

- With default settings execution the final value is denormalized (5.877472e-39).
- With the flag "denormals are zeros" execution the result is zero.
- With the flag "flush to zero" the final value is also zero.

completion time: in seconds (speed-up to default settings) | |||
---|---|---|---|

architecture | default settings | denormals are zeros | flush to zero |

Core2 | 1.841 | 0.858 (2x) | 0.121 (15x) |

Skylake | 0.440 | 0.400 | 0.033 (13x) |

SkylakeX | 0.524 | 0.526 | 0.039 (13x) |

The test programs are available.